[ovs-dev] [PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.

Thu May 16 06:58:44 UTC 2019

On 16.05.2019 2:20, William Tu wrote:
>>> +   OVS AF_XDP netdev is using the userspace datapath, the same datapath
>>> +   as used by OVS-DPDK.  So it requires --disable-system for ovs-vswitchd
>>> +   and datapath_type=netdev when adding a new bridge.
>>
>> I don't think that '--disable-system' is needed. It doesn't affect anything.
>>
> 
> Thanks I will remove it.
> 
>> <snip>
>>
>>> +int
>>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                              struct dp_packet_batch *batch)
>>> +{
>>
>> One important issue here. netdev_linux_send() is thread-safe, because
>> all the syscalls and memory allocations there are thread-safe.
>> However, all the xsk_ring_* APIs are not thread safe and if two
>> threads will try to send packets to the same tx queue they might
>> destroy the rings. So, it's necessary to start using 'concurrent_txq'
>> flag with per-queue locks.
>> Note that 'concurrent_txq' == 'false' only if 'n_txq' > 'n_pmd_threads'.
>>
> 
> Thanks!
> I have one question. For example if I have n_txq=4 and n_pmd_threds=2,
> then concurrent_txq = false.
> 
> Assume pmd1 processing rx queue0 on port1 and pmd2 processes rx queue0 on port2.
> What if both pmd1 and pmd2 try to send AF_XDP packet tx queue0 on port2?
> Then both pmd threads are calling the send function on port2 queue0
> concurrently.
> Does that mean I have to unconditionally add per-queue lock?

No. You don't need that. dpif-netdev manages Tx queues in a way that
two threads will never use same Tx queue of the same netdev without
'concurrent_txq' set to 'true'.

In your case above you have 'n_txq' > 'n_pmd_threads' and dpif-netdev
will use static Tx queue ids, i.e. pmd1 will always use Tx queue #0
and pmd2 will always use Tx queue #1, main thread will always use
Tx queue #2.

If you'll have 'n_txq' <= 'n_pmd_threads', dpif-netdev will use dynamic
Tx queue ids with some kind of XPS mechanism. i.e. threads will allocate
Tx queue id before sending. XPS will try, but it doesn't guarantee that
other thread will not use same queue, so 'concurrent_txq' will be always
"true" for this port.

You may use 'netdev_dpdk_send__()' as a reference.

> 
> Regards,
> William
> 
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t tx_done, idx_cq = 0;
>>> +    struct dp_packet *packet;
>>> +    uint32_t idx = 0;
>>> +    int j, ret, retry_count = 0;
>>> +    const int max_retry = 4;
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * We can avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>> +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
>>> +
>>> +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void **)elems_pop);
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +        return ret;
>>> +    }
>>> +
>>> +retry:
>>> +    /* Process CQ */
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk->outstanding_tx -= tx_done;
>>> +        xsk->tx_npkts += tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void **)elems_push);
>>> +    ovs_assert(ret == 0);
>>> +
>>> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +
>>> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) {
>>> +        /* If there are still a lot not transmitted, try harder. */
>>> +        if (retry_count++ > max_retry) {
>>> +            return 0;
>>> +        }
>>> +        goto retry;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
> 
>