[ovs-dev] [PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.

Eelco Chaudron echaudro at redhat.com
Mon May 20 13:37:49 UTC 2019


On 18 May 2019, at 3:16, William Tu wrote:

> Hi Eelco,
>
> Thanks for all the feedbacks. There are some issues in driver, some
> in libbpf, and some in my implementation. I will work on it ASAP.

My pleasure, see answers to your questions below…
>
> On Fri, May 17, 2019 at 3:23 AM Eelco Chaudron <echaudro at redhat.com> 
> wrote:
>>
>> Hi William,
>>
>> First a list of issues I found during some basic testing...
>>
>> - When I restart or stop OVS (using the systemctl interface as found 
>> in
>> RHEL) it does not clean up the BFP program causing the restart to 
>> fail:
>>
>>    2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
>> reconfig fails
>>    2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
>> interface eno1 new configuration
>>
>>    I need to manually run "ip link set dev eno1 xdp off" to make it
>> recover.
>
> I think this is a bug in libbpf, see
> [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
>
>>
>>
>> - When I remove a bridge, I get an emer in the revalidator:
>>
>>    2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp 
>> program
>>    2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
>> assertion !fd != !wevent failed in poll_create_node()
>>
>>    Easy to replicate with this:
>>
>>      $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
>> datapath_type=netdev
>>      $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
>> type="afxdp" options:xdpmode=drv
>>      $ ovs-vsctl del-br ovs_pvp_br0
>>
> Thanks I can reproduce it. Will make sure to fix it .
>
>>
>> - High pmd usage on the statistics, even with no packets is this
>> expected?
>>
>>    $ ovs-appctl dpif-netdev/pmd-rxq-show
>>    pmd thread numa_id 0 core_id 1:
>>      isolated : false
>>      port: dpdk0             queue-id:  0  pmd usage:  0 %
>>      port: eno1              queue-id:  0  pmd usage: 49 %
>>
>>    It goes up slowly and gets stuck at 49%
>>
>>
>> - When doing the PVP testing I noticed that the physical port has 
>> odd/no
>>    tx statistics:
>>
>>    $ ovs-ofctl dump-ports ovs_pvp_br0
>>    OFPST_PORT reply (xid=0x2): 3 ports
>>      port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0,
>> crc=0
>>               tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>>      port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0,
>> frame=0, over=0, crc=0
>>               tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>>      port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, 
>> frame=0,
>> over=0, crc=0
>>               tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
>>
> I think the ixgbe driver has some issue.

Cool, this is the driver I’m using…

> If you run skb-mode, I think the stats are correct.
> See patch [1/2] ixgbe: fix AF_XDP tx byte count
>
>>
>> - Packets larger than 1028 bytes are dropped. Guess this needs to be
>> fixed, and we need to state that jumbo frames are not supported. Are 
>> you
>> planning on adding this?
>>
>>    Currently I can find not mentioning of MTU limitation in the
>> documentation, or any code to prevent it from being changed above the
>> supported limit.
>>
>>
>> - ovs-vswitchd is still crashing or stops forwarding packets when 
>> trying
>> to do
>>    PVP testing with Qemu that has a TAP interface doing XDP and 
>> running
>> packets
>>    at wire speed to the 10G interface.
>>
>>    When trying with lower volume packets it seems to work, so with 1%
>> traffic
>>    rate, it forwards packets without any problems (148,771 pps). If I 
>> go
>> to
>>    10% the first couple of packet pass, then it stops forwarding. If
>> it's not
>>    crashing I still see packets being received by eno1 flow rules, 
>> but
>> no
>>    packets make it to the VM.
>>
>>      Program terminated with signal SIGSEGV, Segmentation fault.
>>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send 
>> (xsk=0x0,
>> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>>      654            ret = umem_elem_pop_n(&xsk->umem->mpool, 
>> batch->count,
>> (void **)elems_pop);
>>      [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
>>      Missing separate debuginfos, use: dnf debuginfo-install
>> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
>>      (gdb) bt
>>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send 
>> (xsk=0x0,
>> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>>      #1  0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540,
>> qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized
>> out>) at lib/netdev-linux.c:1486
>>      #2  0x0000000000906051 in netdev_send (netdev=<optimized out>,
>> qid=qid at entry=0, batch=batch at entry=0x7fc928005570,
>> concurrent_txq=concurrent_txq at entry=true)
>>          at lib/netdev.c:797
>>      #3  0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port
>> (pmd=pmd at entry=0x7fc95e735010, p=p at entry=0x7fc928005540) at
>> lib/dpif-netdev.c:4185
>>      #4  0x00000000008d2faf in dp_netdev_pmd_flush_output_packets
>> (pmd=pmd at entry=0x7fc95e735010, force=force at entry=false) at
>> lib/dpif-netdev.c:4225
>>      #5  0x00000000008db317 in dp_netdev_pmd_flush_output_packets
>> (force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
>>      #6  dp_netdev_process_rxq_port (pmd=pmd at entry=0x7fc95e735010,
>> rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
>>      #7  0x00000000008db67d in pmd_thread_main (f_=<optimized out>) 
>> at
>> lib/dpif-netdev.c:5446
>>      #8  0x000000000095c96d in ovsthread_wrapper (aux_=<optimized 
>> out>)
>> at lib/ovs-thread.c:352
>>      #9  0x00007fc9789d62de in start_thread () from
>> /lib64/libpthread.so.0
>>      #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6
>>
>>
>> - make check-afxpd is failing for me, however, make check-kernel 
>> works
>> fine.
>>    Did not dive into it too much, but it fails here for all test 
>> cases,
>> this is the same build I use for testing.
>>
>>    ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge
>> br0 datapath_type=netdev
>> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
>> fail-mode=secure  --
>>    --- /dev/null        2019-05-16 09:09:33.445562692 -0400
>>    +++
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr 
>>        2019-05-17
>> 05:46:20.506814939 -0400
>>    @@ -0,0 +1,2 @@
>>    +ovs-vsctl: Error detected while setting up 'br0'.  See 
>> ovs-vswitchd
>> log for details.
>>    +ovs-vsctl: The default log directory is
>> "/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
>>    ovsdb-server.log:
>>   > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
>>   > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server 
>> (Open
>> vSwitch) 2.11.90
>>    ovs-vswitchd.log:
>>   > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file
>> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
>>   > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU 
>> cores
>> on NUMA node 0
>>   > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA 
>> nodes
>> and 28 CPU cores
>>   >
>> 2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
>> connecting...
>>   >
>> 2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
>> connected
>>   > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open
>> vSwitch) 2.11.90
>>   > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev:
>> creating tap device failed: Device or resource busy
>
> I think the tap device is not cleared from previous settings.
> Or do
> #ip link del dev ovs-netdev
> then check again.

If after a reboot the first thing I do is the check, it seems to work. 
If OVS is running its failing all tests, which was my scenario.

>>   > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev 
>> already
>> exists but cannot be opened: No such device
>>   > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open
>> datapath of type netdev: No such device
>>   > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open 
>> datapath
>> br0: No such device
>>   > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge
>> br0: No such device
>>    1. system-afxdp-traffic.at:3:  FAILED (system-afxdp-traffic.at:4)
>>
>>
>>
>>
>> The following might be useful when combining DPDK and AF_XDP:
>>
>>    Currently, DPDK and AF_XDP polling can be combined on a single PMD
>> thread, it
>>    might be nice to have an option to not do this, i.e. have separate
>> PMD
>>    threads for each type. I know we can do this with assigning 
>> specific
>> PMDs to
>>    queues, but this will disable auto-balancing. This will also help
>> later if
>>    we would add poll() mode support for AF_XDP.
>>
>>
>> Other review comments see inline below. I reviewed the code, not the
>> unit tests or automake changes.
>>
> <snip>
>
>>> +.. note::
>>> +   OVS AF_XDP netdev is using the userspace datapath, the same
>>> datapath
>>> +   as used by OVS-DPDK.  So it requires --disable-system for
>>> ovs-vswitchd
>>> +   and datapath_type=netdev when adding a new bridge.
>>
>> As mentioned earlier offline I think --disable-system can be removed 
>> as
>> the Kernel and userspace datapath can be run at the same time.
>>
> Yes, thanks
>
>>> +
>>> +Make sure your device driver support AF_XDP, and to use 1 PMD (on
>>> core 4)
>>> +on 1 queue (queue 0) device, configure these options: 
>>> **pmd-cpu-mask,
>>> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
>>> "skb"::
>>
>> Wondering how options:xdpmode should operate without it being 
>> specified?
>> I would prefer that if the option is not specified it would try drv, 
>> and
>> if it fails fallback to skb.
>>
> by default it is using skb mode. I prefer skb-mode as default since
> it has less driver-related issues.

I see it from the other way around, we should use HW support as much as 
possible.
We get better performance, and as this is experimental we will find the 
driver issues earlier.

>> We need to add these new options to the vswitch.xml file
>
> OK
>>
>>> +
>>> +  ethtool -L enp2s0 combined 1
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=1 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:4"
>>> +
>>> +Or, use 4 pmds/cores and 4 queues by doing::
>>> +
>>> +  ethtool -L enp2s0 combined 4
>>> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
>>> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 
>>> type="afxdp"
>>> \
>>> +    options:n_rxq=4 options:xdpmode=drv \
>>> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
>>> +
>>
>> Add some text that pmd-rxq-affinity is not a requirement, the system
>> will auto (re)assign.
>> Also, note that cores used by pmd-rxq-affinity are not shared/used by
>> floating PMDs.
>
> Good point, thanks
>>
>>> +To validate that the bridge has successfully instantiated, you can
>>> use the::
>>> +
>>> +  ovs-vsctl show
>>> +
>>> +should show something like::
>>> +
>>> +  Port "ens802f0"
>>> +   Interface "ens802f0"
>>> +      type: afxdp
>>> +      options: {n_rxq="1", xdpmode=drv}
>>> +
>>> +Otherwise, enable debug by::
>>> +
>>> +  ovs-appctl vlog/set netdev_afxdp::dbg
>>> +
>>> +References
>>> +----------
>>> +Most of the design details are described in the paper presented at
>>> +Linux Plumber 2018, "Bringing the Power of eBPF to Open 
>>> vSwitch"[1],
>>> +section 4, and slides[2][4].
>>> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
>>> introduction
>>> +about AF_XDP current and future work.
>>> +
>>> +
>>> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
>>> +
>>> +[2]
>>> http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
>>> +
>>> +[3]
>>> http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
>>> +
>>> +[4]
>>> https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
>>> +
>>> +
>>> +Performance Tuning
>>> +------------------
>>> +The name of the game is to keep your CPU running in userspace,
>>> allowing PMD
>>> +to keep polling the AF_XDP queues without any interferences from
>>> kernel.
>>> +
>>> +#. Make sure everything is in the same NUMA node (memory used by
>>> AF_XDP, pmd
>>> +   running cores, device plug-in slot)
>>
>> How can you do this? The code is not taking care of NUMA, and memory 
>> is
>> allocated with posix_memalign so no idea which NUMA node it gets
>> allocated.
>
> right... I'm hoping that users can be aware of this and run 
> ovs-vswitchd
> using taskset -p <cpu mask>
> So running the process on the correct NUMA node.

Not too familiar with the posix_memalign backend, will this force to use 
the correct NUMA memory.

This will not work in practice as ovs-vswitchd gets started, for 
example, by systemd. This will also not work if you have NICs on 
different NUMAs.

>>
>>> +#. Isolate your CPU by doing isolcpu at grub configure.
>>> +
>>> +#. IRQ should not set to pmd running core.
>>> +
>>> +#. The Spectre and Meltdown fixes increase the overhead of system
>>> calls.
>>> +
>>
>> Maybe be more consistent, either one or two newlines before a 
>> heading?
> OK
>
> <snip>
>
>>> +
>>> +Create OpenFlow rules::
>>> +
>>> +  ovs-vsctl add-port br0 tap0
>>
>> Maybe add tap as XDP or else it will be an AF_PACKET interface 
>> polling
>> in the main thread.
>
> I think it should work (XDP on tap). Let me try.
> What's your concern here about polling AF_PACKET?

So when mixing native kernel drivers with the userspace datapath. The 
kernel interfaces are serviced by an AF_PACKET socket the main OVS 
thread, and not in the PMD threads. This could delay general tasks, or 
general tasks will delay packet processing. If you use an XDP version of 
the TAP it’s polled in the PMD thread.

>
>>
>>> +  ovs-ofctl del-flows br0
>>> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
>>> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
>>> +
>>> +
>>> +Attach the veth port to br0 (linux kernel mode)::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
>>> +
>>
>> Remove the xdpmode=skb above... Also, see above on the PF_PACKET
>> interface in the bridge_run(),
>> I would advise against using this, and you might want to remove it.
>
> OK
>>
>>> +
>>> +Or, use AF_XDP with skb mode::
>>> +
>>> +  ovs-vsctl add-port br0 afxdp-p0 -- \
>>> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
>>> options:xdpmode=skb
>>> +
>>> +Setup the OpenFlow rules::
>>> +
> <snip>
>>> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
>>> index 0976a35e758b..7d086dc5e860 100644
>>> --- a/lib/dp-packet.c
>>> +++ b/lib/dp-packet.c
>>> @@ -22,6 +22,9 @@
>>>  #include "netdev-dpdk.h"
>>>  #include "openvswitch/dynamic-string.h"
>>>  #include "util.h"
>>> +#ifdef HAVE_AF_XDP
>>> +#include "netdev-afxdp.h"
>>> +#endif
>>
>> Why the protection above? You do not do this in netdev-linux.c.
>> Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?
>>
>
> OK will fix it
>
>>>  static void
>>>  dp_packet_init__(struct dp_packet *b, size_t allocated, enum
>>> dp_packet_source source)
>>> @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base,
>>> size_t allocated)
>>>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>>>  }
>>>
>
> <snip>
>
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.c
>>> @@ -0,0 +1,727 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#if !defined(__i386__) && !defined(__x86_64__)
>>> +#error AF_XDP supported only for Linux on x86 or x86_64
>>
>> Any reason why we do not support PPC and ARM?
>>
>>> +#endif
>>> +
>>> +#include <config.h>
>>> +
>>> +#include "netdev-linux-private.h"
>>> +#include "netdev-linux.h"
>>
>> Swap the two above, see comment in netdev-linux-private.h
>>
>>> +#include "netdev-afxdp.h"
>>> +
>>> +#include <arpa/inet.h>
>>> +#include <errno.h>
>>> +#include <fcntl.h>
>>> +#include <inttypes.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <linux/rtnetlink.h>
>>> +#include <linux/sockios.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <net/if.h>
>>> +#include <net/if_arp.h>
>>> +#include <net/route.h>
>>> +#include <netinet/in.h>
>>> +#include <netpacket/packet.h>
>>> +#include <poll.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/ioctl.h>
>>> +#include <sys/types.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/utsname.h>
>>> +#include <unistd.h>
>>> +
>>
>> Some of these includes are included by netdev-linux(-private).h 
>> already
>> so why not remove them?
>
> OK will remove them.
>>
>>> +#include "coverage.h"
>>> +#include "dp-packet.h"
>>> +#include "dpif-netlink.h"
>>> +#include "dpif-netdev.h"
>>> +#include "fatal-signal.h"
>>> +#include "hash.h"
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "netlink-notifier.h"
>>> +#include "netlink-socket.h"
>>> +#include "netlink.h"
>>> +#include "netnsid.h"
>>> +#include "openflow/openflow.h"
>>> +#include "openvswitch/dynamic-string.h"
>>> +#include "openvswitch/hmap.h"
>>> +#include "openvswitch/ofpbuf.h"
>>> +#include "openvswitch/poll-loop.h"
>>> +#include "openvswitch/vlog.h"
>>> +#include "openvswitch/shash.h"
>>> +#include "ovs-atomic.h"
>>> +#include "packets.h"
>>> +#include "rtnetlink.h"
>>> +#include "socket-util.h"
>>> +#include "sset.h"
>>> +#include "tc.h"
>>> +#include "timer.h"
>>> +#include "unaligned.h"
>>> +#include "util.h"
>>> +#include "xdpsock.h"
>>> +
>>> +#ifndef SOL_XDP
>>> +#define SOL_XDP 283
>>> +#endif
>>> +#ifndef AF_XDP
>>> +#define AF_XDP 44
>>> +#endif
>>> +#ifndef PF_XDP
>>> +#define PF_XDP AF_XDP
>>> +#endif
>>
>> Do we really need to include the above? Or should we update the 
>> install
>> instruction to move them over from the kernel headers?
>>
> I think we only need SOL_XDP.
>
>>> +
>>> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
>>> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>> +
>>> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
>>> *)base))
>>> +#define UMEM2XPKT(base, i) \
>>> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char 
>>> *)base
>>> + \
>>> +                               i * sizeof(struct dp_packet_afxdp))
>>> +
>
> <snip>
> Some comments here about umem and queue are discussed later with Ilya.
> Will address them together.
>
>>> +int
>>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                              struct dp_packet_batch *batch)
>>> +{
>>
>> See Ilya's comment on thread safety on the ring APIs.
>>
>>> +    struct umem_elem *elems_pop[BATCH_SIZE];
>>> +    struct umem_elem *elems_push[BATCH_SIZE];
>>> +    uint32_t tx_done, idx_cq = 0;
>>> +    struct dp_packet *packet;
>>> +    uint32_t idx = 0;
>>> +    int j, ret, retry_count = 0;
>>> +    const int max_retry = 4;
>>> +
>>> +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    /* Make sure we have enough TX descs */
>>> +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
>>> +    if (OVS_UNLIKELY(ret == 0)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +        return EAGAIN;
>>> +    }
>>> +
>>> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t index;
>>> +
>>> +        elem = elems_pop[i];
>>> +        /* Copy the packet to the umem we just pop from umem pool.
>>> +         * We can avoid this copy if the packet and the pop umem
>>> +         * are located in the same umem.
>>> +         */
>>
>> The comment mentions the copy can be avoided, but it's not 
>> implemented
>> in the code, is this correct or was something removed?
>>
>>> +        memcpy(elem, dp_packet_data(packet), 
>>> dp_packet_size(packet));
>
> it's not implemented yet. now it's always making a copy
>>> +
>>> +        index = (uint64_t)((char *)elem - (char 
>>> *)xsk->umem->buffer);
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
>>> +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
>>> +            = dp_packet_size(packet);
>>> +    }
>>> +    xsk_ring_prod__submit(&xsk->tx, batch->count);
>>> +    xsk->outstanding_tx += batch->count;
>>> +
>>> +    ret = kick_tx(xsk);
>>> +    if (OVS_UNLIKELY(ret)) {
>>> +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
>>> **)elems_pop);
>>> +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
>>> +                     ovs_strerror(ret));
>>> +        return ret;
>>
>> I think we should still try to recover the CQ below, even on failure.
>>
>>> +    }
>>> +
>>> +retry:
>>> +    /* Process CQ */
>>> +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count,
>>> &idx_cq);
>>> +    if (tx_done > 0) {
>>> +        xsk->outstanding_tx -= tx_done;
>>> +        xsk->tx_npkts += tx_done;
>>> +    }
>>> +
>>> +    /* Recycle back to umem pool */
>>> +    for (j = 0; j < tx_done; j++) {
>>> +        struct umem_elem *elem;
>>> +        uint64_t addr;
>>> +
>>> +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
>>> +
>>> +        elem = ALIGNED_CAST(struct umem_elem *,
>>> +                            (char *)xsk->umem->buffer + addr);
>>> +        elems_push[j] = elem;
>>> +    }
>>> +
>>> +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
>>> **)elems_push);
>>> +    ovs_assert(ret == 0);
>>> +
>>> +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
>>> +
>>> +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 
>>> 2))
>>> {
>>> +        /* If there are still a lot not transmitted, try harder. */
>>> +        if (retry_count++ > max_retry) {
>>> +            return 0;
>>> +        }
>>> +        goto retry;
>>> +    }
>>> +
>>
>> I think the code above is causing my lockup at wire speed mentioned
>> above...
>> I guess the retry_count expires every transmit sending packets to the
>> TAP interface.
>> No all buffers are used... This is causing the umem_elem_pop_n() in 
>> the
>> beginning to fail, hence the buffers are never returned!
>>
>> Guess we might need some reclaim in the beginning, or maybe even in 
>> the
>> rx loop?
>
> right, let me re-work this part of code.
>
>>> +    return 0;
>>> +}
>>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
>>> new file mode 100644
>>> index 000000000000..6518d8fca0b5
>>> --- /dev/null
>>> +++ b/lib/netdev-afxdp.h
>>> @@ -0,0 +1,53 @@
>>> +/*
>>> + * Copyright (c) 2018 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_AFXDP_H
>>> +#define NETDEV_AFXDP_H 1
>>> +
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +/* These functions are Linux AF_XDP specific, so they should be 
>>> used
>>> directly
>>> + * only by Linux-specific code. */
>>
>> Extra enter?
>>
>>> +#define MAX_XSKQ 16
>>
>> Extra enter?
>>
>
> OK
>
>>> +struct netdev;
>>> +struct xsk_socket_info;
>>> +struct xdp_umem;
>>> +struct dp_packet_batch;
>>> +struct smap;
>>> +struct dp_packet;
>>> +
>>> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct 
>>> dp_packet
>>> *d);
>>> +
>>> +int xsk_configure_all(struct netdev *netdev);
>>> +
>>> +void xsk_destroy_all(struct netdev *netdev);
>>> +
>>> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
>>> +                         struct dp_packet_batch *batch);
>>> +
>>> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
>>> +                                  struct dp_packet_batch *batch);
>>> +
>>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct 
>>> smap
>>> *args,
>>> +                            char **errp);
>>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct 
>>> smap
>>> *args);
>>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
>>> +
>>> +void free_afxdp_buf(struct dp_packet *p);
>>> +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
>>> +int netdev_afxdp_reconfigure(struct netdev *netdev);
>>> +#endif /* netdev-afxdp.h */
>>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>>> new file mode 100644
>>> index 000000000000..3dd3d902b3c4
>>> --- /dev/null
>>> +++ b/lib/netdev-linux-private.h
>>> @@ -0,0 +1,124 @@
>>> +/*
>>> + * Copyright (c) 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef NETDEV_LINUX_PRIVATE_H
>>> +#define NETDEV_LINUX_PRIVATE_H 1
>>> +
>>> +#include <config.h>
>>> +
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +#include <stdint.h>
>>> +#include <stdbool.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>
>> Why include all the above? They where just added to netdev-linux.h, 
>> so
>> if you make sure you include netdev-lunux.h before -private it should
>> work out.
>>
>>> +
>>> +#if HAVE_AF_XDP
>>> +#include "netdev-afxdp.h"
>>> +#endif
>>
>> See earlier comment
>>
>>> +
>>> +/* These functions are Linux specific, so they should be used
>>> directly only by
>>> + * Linux-specific code. */
>>> +
>>> +struct netdev;
>>> +
>>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
>>> flag,
>>> +                                  const char *flag_name, bool
>>> enable);
>>> +int linux_get_ifindex(const char *netdev_name);
>>> +
>>
>> These functions are now both specified in netdev-linux.h and
>> netdev-linux-private.h
>>
>>> +#define LINUX_FLOW_OFFLOAD_API                          \
>>> +   .flow_flush = netdev_tc_flow_flush,                  \
>>> +   .flow_dump_create = netdev_tc_flow_dump_create,      \
>>> +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
>>> +   .flow_dump_next = netdev_tc_flow_dump_next,          \
>>> +   .flow_put = netdev_tc_flow_put,                      \
>>> +   .flow_get = netdev_tc_flow_get,                      \
>>> +   .flow_del = netdev_tc_flow_del,                      \
>>> +   .init_flow_api = netdev_tc_init_flow_api
>>> +
>>
>> Same here, this define is in both include files.
>>
>>> +struct netdev_linux {
>>> +    struct netdev up;
>>> +
>>> +    /* Protects all members below. */
>>> +    struct ovs_mutex mutex;
>>> +
>>> +    unsigned int cache_valid;
>>> +
>>> +    bool miimon;                    /* Link status of last poll. */
>>> +    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> +    struct timer miimon_timer;
>>> +
>>> +    int netnsid;                    /* Network namespace ID. */
>>> +    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> +     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> +    int ifindex;
>>> +    struct eth_addr etheraddr;
>>> +    int mtu;
>>> +    unsigned int ifi_flags;
>>> +    long long int carrier_resets;
>>> +    uint32_t kbits_rate;        /* Policing data. */
>>> +    uint32_t kbits_burst;
>>> +    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> +                                   0 or an errno value. */
>>> +    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> +                                 * or SIOCSIFMTU.
>>> +                                 */
>>> +    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> +    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> +    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> +    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> +
>>> +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> +
>>> +    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> +    struct tc *tc;
>>> +
>>> +    /* For devices of class netdev_tap_class only. */
>>> +    int tap_fd;
>>> +    bool present;               /* If the device is present in the
>>> namespace */
>>> +    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> +
>>> +    /* LAG information. */
>>> +    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> +
>>> +    /* AF_XDP information */
>>> +#ifdef HAVE_AF_XDP
>>> +    struct xsk_socket_info *xsk[MAX_XSKQ];
>>> +    int requested_n_rxq;
>>> +    int xdpmode, requested_xdpmode; /* detect mode changed */
>>> +    int xdp_flags, xdp_bind_flags;
>>> +#endif
>>> +};
>>> +
>>> +static struct netdev_linux *
>>> +netdev_linux_cast(const struct netdev *netdev)
>>> +{
>>
>> In the original definition there was an assert() here, was it removed 
>> by
>> accident?
>> netdev_linux_rxq_xsk
>>> +    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +}
>>> +
>>> +#endif /* netdev-linux-private.h */
>>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
>>> index f75d73fd39f8..1f190406d145 100644
>>> --- a/lib/netdev-linux.c
>>> +++ b/lib/netdev-linux.c
>>> @@ -17,6 +17,7 @@
>>>  #include <config.h>
>>>
>>>  #include "netdev-linux.h"
>>> +#include "netdev-linux-private.h"
>>>
>>>  #include <errno.h>
>>>  #include <fcntl.h>
>>> @@ -54,6 +55,7 @@
>>>  #include "fatal-signal.h"
>>>  #include "hash.h"
>>>  #include "openvswitch/hmap.h"
>>> +#include "netdev-afxdp.h"
>>>  #include "netdev-provider.h"
>>>  #include "netdev-tc-offloads.h"
>>>  #include "netdev-vport.h"
>>> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
>>>  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, 
>>> int
>>> mtu);
>>>  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
>>> burst_bytes);
>>>
>>> -struct netdev_linux {
>>> -    struct netdev up;
>>> -
>>> -    /* Protects all members below. */
>>> -    struct ovs_mutex mutex;
>>> -
>>> -    unsigned int cache_valid;
>>> -
>>> -    bool miimon;                    /* Link status of last poll. */
>>> -    long long int miimon_interval;  /* Miimon Poll rate. Disabled 
>>> if
>>> <= 0. */
>>> -    struct timer miimon_timer;
>>> -
>>> -    int netnsid;                    /* Network namespace ID. */
>>> -    /* The following are figured out "on demand" only.  They are 
>>> only
>>> valid
>>> -     * when the corresponding VALID_* bit in 'cache_valid' is set. 
>>> */
>>> -    int ifindex;
>>> -    struct eth_addr etheraddr;
>>> -    int mtu;
>>> -    unsigned int ifi_flags;
>>> -    long long int carrier_resets;
>>> -    uint32_t kbits_rate;        /* Policing data. */
>>> -    uint32_t kbits_burst;
>>> -    int vport_stats_error;      /* Cached error code from
>>> vport_get_stats().
>>> -                                   0 or an errno value. */
>>> -    int netdev_mtu_error;       /* Cached error code from 
>>> SIOCGIFMTU
>>> or SIOCSIFMTU. */
>>> -    int ether_addr_error;       /* Cached error code from set/get
>>> etheraddr. */
>>> -    int netdev_policing_error;  /* Cached error code from set
>>> policing. */
>>> -    int get_features_error;     /* Cached error code from
>>> ETHTOOL_GSET. */
>>> -    int get_ifindex_error;      /* Cached error code from
>>> SIOCGIFINDEX. */
>>> -
>>> -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. 
>>> */
>>> -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. 
>>> */
>>> -
>>> -    struct ethtool_drvinfo drvinfo;  /* Cached from 
>>> ETHTOOL_GDRVINFO.
>>> */
>>> -    struct tc *tc;
>>> -
>>> -    /* For devices of class netdev_tap_class only. */
>>> -    int tap_fd;
>>> -    bool present;               /* If the device is present in the
>>> namespace */
>>> -    uint64_t tx_dropped;        /* tap device can drop if the iface
>>> is down */
>>> -
>>> -    /* LAG information. */
>>> -    bool is_lag_master;         /* True if the netdev is a LAG
>>> master. */
>>> -};
>>>
>>>  struct netdev_rxq_linux {
>>>      struct netdev_rxq up;
>>> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct 
>>> netdev_class
>>> *netdev_class)
>>>      return netdev_class->run == netdev_linux_run;
>>>  }
>>>
>>> +#if HAVE_AF_XDP
>>>  static bool
>>> -is_tap_netdev(const struct netdev *netdev)
>>> +is_afxdp_netdev(const struct netdev *netdev)
>>>  {
>>> -    return netdev_get_class(netdev) == &netdev_tap_class;
>>> +    return netdev_get_class(netdev) == &netdev_afxdp_class;
>>>  }
>>> -
>>> -static struct netdev_linux *
>>> -netdev_linux_cast(const struct netdev *netdev)
>>> +#else
>>> +static bool
>>> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
>>>  {
>>> -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
>>> -
>>> -    return CONTAINER_OF(netdev, struct netdev_linux, up);
>>> +    return false;
>>> +}
>>> +#endif
>>> +static bool
>>> +is_tap_netdev(const struct netdev *netdev)
>>> +{
>>> +    return netdev_get_class(netdev) == &netdev_tap_class;
>>>  }
>>>
>>>  static struct netdev_rxq_linux *
>>> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
>>>          atomic_count_dec(&miimon_cnt);
>>>      }
>>>
>>> +#if HAVE_AF_XDP
>>> +    if (is_afxdp_netdev(netdev_)) {
>>> +        xsk_destroy_all(netdev_);
>>> +    }
>>> +#endif
>>
>> Think you can remove the HAVE_AF_XDP here, as you do not use it below
>> either.
>
> Yes
>
>>
>>>      ovs_mutex_destroy(&netdev->mutex);
>>>  }
>>>
>>> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq
>>> *rxq_)
>>>      rx->is_tap = is_tap_netdev(netdev_);
>>>      if (rx->is_tap) {
>>>          rx->fd = netdev->tap_fd;
>>> -    } else {
>>> +    } else if (!is_afxdp_netdev(netdev_)) {
>>>          struct sockaddr_ll sll;
>>>          int ifindex, val;
>>>          /* Result of tcpdump -dd inbound */
>>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq 
>>> *rxq_,
>>> struct dp_packet_batch *batch,
>>>  {
>>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>>      struct netdev *netdev = rx->up.netdev;
>>> -    struct dp_packet *buffer;
>>> +    struct dp_packet *buffer = NULL;
>>>      ssize_t retval;
>>>      int mtu;
>>>
>>> +#if HAVE_AF_XDP
>>
>> Think this #if HAVE_AF_XDP can be removed as the compiler should
>> optimize out the if (false).
>>
>>> +    if (is_afxdp_netdev(netdev)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev);
>>> +        int qid = rxq_->queue_id;
>>> +
>>> +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
>>> +    }
>>> +#endif
>>>      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
>>>          mtu = ETH_PAYLOAD_MAX;
>>>      }
>>> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
>>> struct dp_packet_batch *batch,
>>>      /* Assume Ethernet port. No need to set packet_type. */
>>>      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
>>>                                             DP_NETDEV_HEADROOM);
>>> +
>>>      retval = (rx->is_tap
>>>                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
>>>                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
>>> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int
>>> qid OVS_UNUSED,
>>>      int error = 0;
>>>      int sock = 0;
>>>
>>> -    if (!is_tap_netdev(netdev_)) {
>>> +    if (!is_tap_netdev(netdev_) &&
>>> +        !is_afxdp_netdev(netdev_)) {
>>>          if
>>> (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
>>>              error = EOPNOTSUPP;
>>>              goto free_batch;
>>> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int
>>> qid OVS_UNUSED,
>>>          }
>>>
>>>          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
>>> +#if HAVE_AF_XDP
>>
>> Same here remove the #if HAVE_AF_XDP
>>
>>> +    } else if (is_afxdp_netdev(netdev_)) {
>>> +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
>>> +        struct dp_packet_afxdp *xpacket;
>>> +        struct umem_pool *first_mpool;
>>> +        struct dp_packet *packet;
>>> +
>>> +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], 
>>> batch);
>>> +
>>> +        /* all packets must come frome the same umem pool
>>> +         * and has DPBUF_AFXDP type, otherwise free on-by-one
>>> +         */
>>> +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
>>> +            if (packet->source != DPBUF_AFXDP) {
>>> +                goto free_batch;
>>> +            }
>>> +
>>> +            xpacket = dp_packet_cast_afxdp(packet);
>>> +            if (i == 0) {
>>> +                first_mpool = xpacket->mpool;
>>> +                continue;
>>> +            }
>>> +            if (xpacket->mpool != first_mpool) {
>>> +                goto free_batch;
>>> +            }
>>> +        }
>>
>> Why do not we not move all the packet type checks to
>> free_afxdp_buf_batch()?
>
> Here I plan to move them into the afxdp-specific send function.
> So it won't be part of netdev_linux_send(), hopefully it's more clean.
>>
>>> +        /* free in batch */
>>> +        free_afxdp_buf_batch(batch);
>>> +        return error;
>>> +#endif
>>>      } else {
>>>          error = netdev_linux_tap_batch_send(netdev_, batch);
>>>      }
>>> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = 
>>> {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "system",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_linux_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = 
>>> {
>>>  const struct netdev_class netdev_tap_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      .type = "tap",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct_tap,
>>>      .get_stats = netdev_tap_get_stats,
>>>      .get_features = netdev_linux_get_features,
>>> @@ -3343,10 +3352,26 @@ const struct netdev_class
>>> netdev_internal_class = {
>>>      NETDEV_LINUX_CLASS_COMMON,
>>>      LINUX_FLOW_OFFLOAD_API,
>>>      .type = "internal",
>>> +    .is_pmd = false,
>>>      .construct = netdev_linux_construct,
>>>      .get_stats = netdev_internal_get_stats,
>>>      .get_status = netdev_internal_get_status,
>>>  };
>>> +
>>> +#ifdef HAVE_AF_XDP
>>> +const struct netdev_class netdev_afxdp_class = {
>>> +    NETDEV_LINUX_CLASS_COMMON,
>>> +    .type = "afxdp",
>>> +    .is_pmd = true,
>>> +    .construct = netdev_linux_construct,
>>> +    .get_stats = netdev_linux_get_stats,
>>> +    .get_status = netdev_linux_get_status,
>>> +    .set_config = netdev_afxdp_set_config,
>>> +    .get_config = netdev_afxdp_get_config,
>>> +    .reconfigure = netdev_afxdp_reconfigure,
>>> +    .get_numa_id = netdev_afxdp_get_numa_id,
>>> +};
>>> +#endif
>>>
>>>
>>>  #define CODEL_N_QUEUES 0x0000
>>> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
>>> index 17ca9120168a..b812e64cb078 100644
>>> --- a/lib/netdev-linux.h
>>> +++ b/lib/netdev-linux.h
>>> @@ -19,6 +19,20 @@
>>>
>>>  #include <stdint.h>
>>>  #include <stdbool.h>
>>> +#include <linux/filter.h>
>>> +#include <linux/gen_stats.h>
>>> +#include <linux/if_ether.h>
>>> +#include <linux/if_tun.h>
>>> +#include <linux/types.h>
>>> +#include <linux/ethtool.h>
>>> +#include <linux/mii.h>
>>> +
>>> +#include "netdev-provider.h"
>>> +#include "netdev-tc-offloads.h"
>>> +#include "netdev-vport.h"
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +#include "timer.h"
>>
>> Is there a reason why you move all these includes here? If there is 
>> you
>> might as well remove the duplicates from .c files that include
>> netdev-linux.h, for example, netdev-linux.c
>
> <snip>
> Will work on this part later.
>
>>> +static inline void
>>> +ovs_spin_unlock(ovs_spinlock_t *sl)
>>> +{
>>> +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
>>> +}
>>> +
>>> +static inline int OVS_UNUSED
>>> +ovs_spin_trylock(ovs_spinlock_t *sl)
>>> +{
>>> +    int exp = 0;
>>> +    return atomic_compare_exchange_strong_explicit(&sl->locked, 
>>> &exp,
>>> 1,
>>> +                memory_order_acquire,
>>> +                memory_order_relaxed);
>>> +}
>>
>> Move spinlock function out to a common file
>>
> OK, I plan to add lib/spinlock.h and move to it.
>
>>> +
>>> +inline int
>>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>>
>> This is a stack overflow
>>
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(ptr, addrs, n * sizeof(void *));
>>> +    umemp->index += n;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_push_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void
>>> +__umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +    umemp->array[umemp->index++] = addr;
>>> +}
>>> +
>>> +void
>>> +umem_elem_push(struct umem_pool *umemp, void *addr)
>>> +{
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
>>> +        /* stack is overflow, this should not happen */
>>> +        OVS_NOT_REACHED();
>>> +    }
>>
>> Should this not be moved after the spinlock, i.e. to __umem_elem_push
> OK
>
>>
>>> +
>>> +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    __umem_elem_push(umemp, addr);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +}
>>> +
>>> +inline int
>>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
>>> +        return -ENOMEM;
>>> +    }
>>> +
>>> +    umemp->index -= n;
>>> +    ptr = &umemp->array[umemp->index];
>>> +    memcpy(addrs, ptr, n * sizeof(void *));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int
>>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
>>> +{
>>> +    int ret;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ret = __umem_elem_pop_n(umemp, n, addrs);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +inline void *
>>> +__umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>
>> There is no check here to see if there are actual any elements left,
>> like there is for pop_n,
>> so we could corrupt memory/umem_pool
>>
> Yes, I will add a check.
>
>>> +    return umemp->array[--umemp->index];
>>> +}
>>> +
>>> +void *
>>> +umem_elem_pop(struct umem_pool *umemp)
>>> +{
>>> +    void *ptr;
>>> +
>>> +    ovs_spin_lock(&umemp->mutex);
>>> +    ptr = __umem_elem_pop(umemp);
>>> +    ovs_spin_unlock(&umemp->mutex);
>>> +
>>> +    return ptr;
>>> +}
>>> +
>>> +void **
>>> +__umem_pool_alloc(unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(void *)) == 0);
>>
>> We should not assert, just return NULL here.
>>
>>> +    memset(bufs, 0, size * sizeof(void *));
>>> +    return (void **)bufs;
>>> +}
>>> +
>>> +unsigned int
>>> +umem_elem_count(struct umem_pool *mpool)
>>> +{
>>> +    return mpool->index;
>>> +}
>>> +
>>> +int
>>> +umem_pool_init(struct umem_pool *umemp, unsigned int size)
>>> +{
>>> +    umemp->array = __umem_pool_alloc(size);
>>> +    if (!umemp->array) {
>>> +        OVS_NOT_REACHED();
>>
>> If NULL is returned return ENOMEM
>>
>>> +    }
>>> +
>>> +    umemp->size = size;
>>> +    umemp->index = 0;
>>> +    ovs_spinlock_init(&umemp->mutex);
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +umem_pool_cleanup(struct umem_pool *umemp)
>>> +{
>>> +    free(umemp->array);
>>         umemp->array = NULL;
>>> +}
>>> +
>>> +/* AF_XDP metadata init/destroy */
>>> +int
>>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
>>> +{
>>> +    void *bufs;
>>> +
>>> +    /* TODO: check HAVE_POSIX_MEMALIGN  */
>>
>> Guess the above needs to be done
>>
>>> +    ovs_assert(posix_memalign(&bufs, getpagesize(),
>>> +                              size * sizeof(struct 
>>> dp_packet_afxdp))
>>> == 0);
>>
>> We should not assert, just return false
>>
>>> +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
>>> +
>>> +    xp->array = bufs;
>>> +    xp->size = size;
>>> +    return 0;
>>> +}
>>> +
>>> +void
>>> +xpacket_pool_cleanup(struct xpacket_pool *xp)
>>> +{
>>> +    free(xp->array);
>>         xp->array = NULL;
>>> +}
>>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>>> new file mode 100644
>>> index 000000000000..aabaa8e5df24
>>> --- /dev/null
>>> +++ b/lib/xdpsock.h
>>> @@ -0,0 +1,123 @@
>>> +/*
>>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>>> + *
>>> + * Licensed under the Apache License, Version 2.0 (the "License");
>>> + * you may not use this file except in compliance with the License.
>>> + * You may obtain a copy of the License at:
>>> + *
>>> + *     http://www.apache.org/licenses/LICENSE-2.0
>>> + *
>>> + * Unless required by applicable law or agreed to in writing,
>>> software
>>> + * distributed under the License is distributed on an "AS IS" 
>>> BASIS,
>>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied.
>>> + * See the License for the specific language governing permissions
>>> and
>>> + * limitations under the License.
>>> + */
>>> +
>>> +#ifndef XDPSOCK_H
>>> +#define XDPSOCK_H 1
>>> +
>>> +#include <bpf/libbpf.h>
>>> +#include <bpf/xsk.h>
>>> +#include <errno.h>
>>> +#include <getopt.h>
>>> +#include <libgen.h>
>>> +#include <linux/bpf.h>
>>> +#include <linux/if_link.h>
>>> +#include <linux/if_xdp.h>
>>> +#include <linux/if_ether.h>
>>> +#include <locale.h>
>>> +#include <net/if.h>
>>> +#include <poll.h>
>>> +#include <pthread.h>
>>> +#include <signal.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/resource.h>
>>> +#include <sys/socket.h>
>>> +#include <sys/types.h>
>>> +#include <sys/mman.h>
>>> +#include <time.h>
>>> +#include <unistd.h>
>>> +
>>> +#include "openvswitch/thread.h"
>>> +#include "ovs-atomic.h"
>>> +
>>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>>> +#define BATCH_SIZE      NETDEV_MAX_BURST
>>
>> Move this item to the bottom, so you have FRAME specific define's 
>> first
>>
>>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>>> +
>>> +#define NUM_FRAMES      4096
>>
>> Should we add a note/check to make sure this value is a power of 2?
>
> Sure, will do it.
>
>>
>>> +#define PROD_NUM_DESCS  512
>>> +#define CONS_NUM_DESCS  512
>>> +
>>> +#ifdef USE_XSK_DEFAULT
>>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>>> +#endif
>>
>> Any reason for having this? Should we use the default values? They 
>> are
>> 4x larger than you have, did it make any difference in performance
>> results?
>> We could make it configurable like for DPDK, using the
>> n_txq_desc/n_rxq_desc option.
>
> Will add this option later.
>
>>
>>> +
>>> +typedef struct {
>>> +    atomic_int locked;
>>> +} ovs_spinlock_t;
>>> +
>>
>> Think we should move the ovs_spinlock code and includes to some 
>> global
>> place, maybe util or thread
>
> ok, will move to lib/spinlock.h
>
>>
>>> +/* LIFO ptr_array */
>>> +struct umem_pool {
>>> +    int index;      /* point to top */
>>> +    unsigned int size;
>>> +    ovs_spinlock_t mutex;
>>> +    void **array;   /* a pointer array, point to umem buf */
>>> +};
>>> +
>>> +/* array-based dp_packet_afxdp */
>>> +struct xpacket_pool {
>>> +    unsigned int size;
>>> +    struct dp_packet_afxdp **array;
>>> +};
>>> +
>>> +struct xsk_umem_info {
>>> +    struct umem_pool mpool;
>>> +    struct xpacket_pool xpool;
>>> +    struct xsk_ring_prod fq;
>>> +    struct xsk_ring_cons cq;
>>> +    struct xsk_umem *umem;
>>> +    void *buffer;
>>> +};
>>> +
>>> +struct xsk_socket_info {
>>> +    struct xsk_ring_cons rx;
>>> +    struct xsk_ring_prod tx;
>>> +    struct xsk_umem_info *umem;
>>> +    struct xsk_socket *xsk;
>>> +    unsigned long rx_npkts;
>>> +    unsigned long tx_npkts;
>>> +    unsigned long prev_rx_npkts;
>>> +    unsigned long prev_tx_npkts;
>>> +    uint32_t outstanding_tx;
>>> +};
>>> +
>>> +struct umem_elem {
>>> +    struct umem_elem *next;
>>> +};
>>> +
>>> +void __umem_elem_push(struct umem_pool *umemp, void *addr);
>>> +void umem_elem_push(struct umem_pool *umemp, void *addr);
>>> +int __umem_elem_push_n(struct umem_pool *umemp, int n, void 
>>> **addrs);
>>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
>>> +
>>> +void *__umem_elem_pop(struct umem_pool *umemp);
>>> +void *umem_elem_pop(struct umem_pool *umemp);
>>> +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void 
>>> **addrs);
>>> +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
>>> +
>>> +void **__umem_pool_alloc(unsigned int size);
>>> +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
>>> +void umem_pool_cleanup(struct umem_pool *umemp);
>>> +unsigned int umem_elem_count(struct umem_pool *mpool);
>>> +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
>>> +void xpacket_pool_cleanup(struct xpacket_pool *xp);
>>> +
>>
>> Think all the __umem_* function are only used internally so they 
>> should
>> be come static and be removed here.
>>
> OK.
>
> Regards,
> William


More information about the dev mailing list