[ovs-dev] [PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.

William Tu u9012063 at gmail.com
Sat May 18 01:16:54 UTC 2019


Hi Eelco,

Thanks for all the feedbacks. There are some issues in driver, some
in libbpf, and some in my implementation. I will work on it ASAP.

On Fri, May 17, 2019 at 3:23 AM Eelco Chaudron <echaudro at redhat.com> wrote:
>
> Hi William,
>
> First a list of issues I found during some basic testing...
>
> - When I restart or stop OVS (using the systemctl interface as found in
> RHEL) it does not clean up the BFP program causing the restart to fail:
>
>    2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> reconfig fails
>    2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
> interface eno1 new configuration
>
>    I need to manually run "ip link set dev eno1 xdp off" to make it
> recover.

I think this is a bug in libbpf, see
[PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown

>
>
> - When I remove a bridge, I get an emer in the revalidator:
>
>    2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
>    2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
> assertion !fd != !wevent failed in poll_create_node()
>
>    Easy to replicate with this:
>
>      $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
> datapath_type=netdev
>      $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
> type="afxdp" options:xdpmode=drv
>      $ ovs-vsctl del-br ovs_pvp_br0
>
Thanks I can reproduce it. Will make sure to fix it .

>
> - High pmd usage on the statistics, even with no packets is this
> expected?
>
>    $ ovs-appctl dpif-netdev/pmd-rxq-show
>    pmd thread numa_id 0 core_id 1:
>      isolated : false
>      port: dpdk0             queue-id:  0  pmd usage:  0 %
>      port: eno1              queue-id:  0  pmd usage: 49 %
>
>    It goes up slowly and gets stuck at 49%
>
>
> - When doing the PVP testing I noticed that the physical port has odd/no
>    tx statistics:
>
>    $ ovs-ofctl dump-ports ovs_pvp_br0
>    OFPST_PORT reply (xid=0x2): 3 ports
>      port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0,
> crc=0
>               tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>      port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0,
> frame=0, over=0, crc=0
>               tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>      port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0,
> over=0, crc=0
>               tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
>
I think the ixgbe driver has some issue.
If you run skb-mode, I think the stats are correct.
See patch [1/2] ixgbe: fix AF_XDP tx byte count

>
> - Packets larger than 1028 bytes are dropped. Guess this needs to be
> fixed, and we need to state that jumbo frames are not supported. Are you
> planning on adding this?
>
>    Currently I can find not mentioning of MTU limitation in the
> documentation, or any code to prevent it from being changed above the
> supported limit.
>
>
> - ovs-vswitchd is still crashing or stops forwarding packets when trying
> to do
>    PVP testing with Qemu that has a TAP interface doing XDP and running
> packets
>    at wire speed to the 10G interface.
>
>    When trying with lower volume packets it seems to work, so with 1%
> traffic
>    rate, it forwards packets without any problems (148,771 pps). If I go
> to
>    10% the first couple of packet pass, then it stops forwarding. If
> it's not
>    crashing I still see packets being received by eno1 flow rules, but
> no
>    packets make it to the VM.
>
>      Program terminated with signal SIGSEGV, Segmentation fault.
>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>      654            ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count,
> (void **)elems_pop);
>      [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
>      Missing separate debuginfos, use: dnf debuginfo-install
> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
>      (gdb) bt
>      #0  0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
>      #1  0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540,
> qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized
> out>) at lib/netdev-linux.c:1486
>      #2  0x0000000000906051 in netdev_send (netdev=<optimized out>,
> qid=qid at entry=0, batch=batch at entry=0x7fc928005570,
> concurrent_txq=concurrent_txq at entry=true)
>          at lib/netdev.c:797
>      #3  0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port
> (pmd=pmd at entry=0x7fc95e735010, p=p at entry=0x7fc928005540) at
> lib/dpif-netdev.c:4185
>      #4  0x00000000008d2faf in dp_netdev_pmd_flush_output_packets
> (pmd=pmd at entry=0x7fc95e735010, force=force at entry=false) at
> lib/dpif-netdev.c:4225
>      #5  0x00000000008db317 in dp_netdev_pmd_flush_output_packets
> (force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
>      #6  dp_netdev_process_rxq_port (pmd=pmd at entry=0x7fc95e735010,
> rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
>      #7  0x00000000008db67d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5446
>      #8  0x000000000095c96d in ovsthread_wrapper (aux_=<optimized out>)
> at lib/ovs-thread.c:352
>      #9  0x00007fc9789d62de in start_thread () from
> /lib64/libpthread.so.0
>      #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6
>
>
> - make check-afxpd is failing for me, however, make check-kernel works
> fine.
>    Did not dive into it too much, but it fails here for all test cases,
> this is the same build I use for testing.
>
>    ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge
> br0 datapath_type=netdev
> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
> fail-mode=secure  --
>    --- /dev/null        2019-05-16 09:09:33.445562692 -0400
>    +++
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr        2019-05-17
> 05:46:20.506814939 -0400
>    @@ -0,0 +1,2 @@
>    +ovs-vsctl: Error detected while setting up 'br0'.  See ovs-vswitchd
> log for details.
>    +ovs-vsctl: The default log directory is
> "/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
>    ovsdb-server.log:
>   > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
>   > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server (Open
> vSwitch) 2.11.90
>    ovs-vswitchd.log:
>   > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
>   > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU cores
> on NUMA node 0
>   > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes
> and 28 CPU cores
>   >
> 2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connecting...
>   >
> 2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connected
>   > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open
> vSwitch) 2.11.90
>   > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev:
> creating tap device failed: Device or resource busy

I think the tap device is not cleared from previous settings.
Or do
#ip link del dev ovs-netdev
then check again.

>   > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev already
> exists but cannot be opened: No such device
>   > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open
> datapath of type netdev: No such device
>   > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open datapath
> br0: No such device
>   > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge
> br0: No such device
>    1. system-afxdp-traffic.at:3:  FAILED (system-afxdp-traffic.at:4)
>
>
>
>
> The following might be useful when combining DPDK and AF_XDP:
>
>    Currently, DPDK and AF_XDP polling can be combined on a single PMD
> thread, it
>    might be nice to have an option to not do this, i.e. have separate
> PMD
>    threads for each type. I know we can do this with assigning specific
> PMDs to
>    queues, but this will disable auto-balancing. This will also help
> later if
>    we would add poll() mode support for AF_XDP.
>
>
> Other review comments see inline below. I reviewed the code, not the
> unit tests or automake changes.
>
<snip>

> > +.. note::
> > +   OVS AF_XDP netdev is using the userspace datapath, the same
> > datapath
> > +   as used by OVS-DPDK.  So it requires --disable-system for
> > ovs-vswitchd
> > +   and datapath_type=netdev when adding a new bridge.
>
> As mentioned earlier offline I think --disable-system can be removed as
> the Kernel and userspace datapath can be run at the same time.
>
Yes, thanks

> > +
> > +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> > core 4)
> > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> > "skb"::
>
> Wondering how options:xdpmode should operate without it being specified?
> I would prefer that if the option is not specified it would try drv, and
> if it fails fallback to skb.
>
by default it is using skb mode. I prefer skb-mode as default since
it has less driver-related issues.

> We need to add these new options to the vswitch.xml file

OK
>
> > +
> > +  ethtool -L enp2s0 combined 1
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=1 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:4"
> > +
> > +Or, use 4 pmds/cores and 4 queues by doing::
> > +
> > +  ethtool -L enp2s0 combined 4
> > +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > +    options:n_rxq=4 options:xdpmode=drv \
> > +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > +
>
> Add some text that pmd-rxq-affinity is not a requirement, the system
> will auto (re)assign.
> Also, note that cores used by pmd-rxq-affinity are not shared/used by
> floating PMDs.

Good point, thanks
>
> > +To validate that the bridge has successfully instantiated, you can
> > use the::
> > +
> > +  ovs-vsctl show
> > +
> > +should show something like::
> > +
> > +  Port "ens802f0"
> > +   Interface "ens802f0"
> > +      type: afxdp
> > +      options: {n_rxq="1", xdpmode=drv}
> > +
> > +Otherwise, enable debug by::
> > +
> > +  ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +References
> > +----------
> > +Most of the design details are described in the paper presented at
> > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> > +section 4, and slides[2][4].
> > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> > introduction
> > +about AF_XDP current and future work.
> > +
> > +
> > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > +
> > +[2]
> > http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > +
> > +[3]
> > http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > +
> > +[4]
> > https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > +
> > +
> > +Performance Tuning
> > +------------------
> > +The name of the game is to keep your CPU running in userspace,
> > allowing PMD
> > +to keep polling the AF_XDP queues without any interferences from
> > kernel.
> > +
> > +#. Make sure everything is in the same NUMA node (memory used by
> > AF_XDP, pmd
> > +   running cores, device plug-in slot)
>
> How can you do this? The code is not taking care of NUMA, and memory is
> allocated with posix_memalign so no idea which NUMA node it gets
> allocated.

right... I'm hoping that users can be aware of this and run ovs-vswitchd
using taskset -p <cpu mask>
So running the process on the correct NUMA node.

>
> > +#. Isolate your CPU by doing isolcpu at grub configure.
> > +
> > +#. IRQ should not set to pmd running core.
> > +
> > +#. The Spectre and Meltdown fixes increase the overhead of system
> > calls.
> > +
>
> Maybe be more consistent, either one or two newlines before a heading?
OK

<snip>

> > +
> > +Create OpenFlow rules::
> > +
> > +  ovs-vsctl add-port br0 tap0
>
> Maybe add tap as XDP or else it will be an AF_PACKET interface polling
> in the main thread.

I think it should work (XDP on tap). Let me try.
What's your concern here about polling AF_PACKET?

>
> > +  ovs-ofctl del-flows br0
> > +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > +
> > +
> > +Attach the veth port to br0 (linux kernel mode)::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
> > +
>
> Remove the xdpmode=skb above... Also, see above on the PF_PACKET
> interface in the bridge_run(),
> I would advise against using this, and you might want to remove it.

OK
>
> > +
> > +Or, use AF_XDP with skb mode::
> > +
> > +  ovs-vsctl add-port br0 afxdp-p0 -- \
> > +    set interface afxdp-p0 type="afxdp" options:n_rxq=1
> > options:xdpmode=skb
> > +
> > +Setup the OpenFlow rules::
> > +
<snip>
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index 0976a35e758b..7d086dc5e860 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -22,6 +22,9 @@
> >  #include "netdev-dpdk.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "util.h"
> > +#ifdef HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> Why the protection above? You do not do this in netdev-linux.c.
> Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?
>

OK will fix it

> >  static void
> >  dp_packet_init__(struct dp_packet *b, size_t allocated, enum
> > dp_packet_source source)
> > @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> > size_t allocated)
> >      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> >  }
> >

<snip>

> > --- /dev/null
> > +++ b/lib/netdev-afxdp.c
> > @@ -0,0 +1,727 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#if !defined(__i386__) && !defined(__x86_64__)
> > +#error AF_XDP supported only for Linux on x86 or x86_64
>
> Any reason why we do not support PPC and ARM?
>
> > +#endif
> > +
> > +#include <config.h>
> > +
> > +#include "netdev-linux-private.h"
> > +#include "netdev-linux.h"
>
> Swap the two above, see comment in netdev-linux-private.h
>
> > +#include "netdev-afxdp.h"
> > +
> > +#include <arpa/inet.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <inttypes.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <linux/rtnetlink.h>
> > +#include <linux/sockios.h>
> > +#include <linux/if_xdp.h>
> > +#include <net/if.h>
> > +#include <net/if_arp.h>
> > +#include <net/route.h>
> > +#include <netinet/in.h>
> > +#include <netpacket/packet.h>
> > +#include <poll.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/utsname.h>
> > +#include <unistd.h>
> > +
>
> Some of these includes are included by netdev-linux(-private).h already
> so why not remove them?

OK will remove them.
>
> > +#include "coverage.h"
> > +#include "dp-packet.h"
> > +#include "dpif-netlink.h"
> > +#include "dpif-netdev.h"
> > +#include "fatal-signal.h"
> > +#include "hash.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "netlink-notifier.h"
> > +#include "netlink-socket.h"
> > +#include "netlink.h"
> > +#include "netnsid.h"
> > +#include "openflow/openflow.h"
> > +#include "openvswitch/dynamic-string.h"
> > +#include "openvswitch/hmap.h"
> > +#include "openvswitch/ofpbuf.h"
> > +#include "openvswitch/poll-loop.h"
> > +#include "openvswitch/vlog.h"
> > +#include "openvswitch/shash.h"
> > +#include "ovs-atomic.h"
> > +#include "packets.h"
> > +#include "rtnetlink.h"
> > +#include "socket-util.h"
> > +#include "sset.h"
> > +#include "tc.h"
> > +#include "timer.h"
> > +#include "unaligned.h"
> > +#include "util.h"
> > +#include "xdpsock.h"
> > +
> > +#ifndef SOL_XDP
> > +#define SOL_XDP 283
> > +#endif
> > +#ifndef AF_XDP
> > +#define AF_XDP 44
> > +#endif
> > +#ifndef PF_XDP
> > +#define PF_XDP AF_XDP
> > +#endif
>
> Do we really need to include the above? Or should we update the install
> instruction to move them over from the kernel headers?
>
I think we only need SOL_XDP.

> > +
> > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > +
> > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> > *)base))
> > +#define UMEM2XPKT(base, i) \
> > +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base
> > + \
> > +                               i * sizeof(struct dp_packet_afxdp))
> > +

<snip>
Some comments here about umem and queue are discussed later with Ilya.
Will address them together.

> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                              struct dp_packet_batch *batch)
> > +{
>
> See Ilya's comment on thread safety on the ring APIs.
>
> > +    struct umem_elem *elems_pop[BATCH_SIZE];
> > +    struct umem_elem *elems_push[BATCH_SIZE];
> > +    uint32_t tx_done, idx_cq = 0;
> > +    struct dp_packet *packet;
> > +    uint32_t idx = 0;
> > +    int j, ret, retry_count = 0;
> > +    const int max_retry = 4;
> > +
> > +    ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        return EAGAIN;
> > +    }
> > +
> > +    /* Make sure we have enough TX descs */
> > +    ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > +    if (OVS_UNLIKELY(ret == 0)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +        return EAGAIN;
> > +    }
> > +
> > +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        struct umem_elem *elem;
> > +        uint64_t index;
> > +
> > +        elem = elems_pop[i];
> > +        /* Copy the packet to the umem we just pop from umem pool.
> > +         * We can avoid this copy if the packet and the pop umem
> > +         * are located in the same umem.
> > +         */
>
> The comment mentions the copy can be avoided, but it's not implemented
> in the code, is this correct or was something removed?
>
> > +        memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));

it's not implemented yet. now it's always making a copy
> > +
> > +        index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > +        xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > +            = dp_packet_size(packet);
> > +    }
> > +    xsk_ring_prod__submit(&xsk->tx, batch->count);
> > +    xsk->outstanding_tx += batch->count;
> > +
> > +    ret = kick_tx(xsk);
> > +    if (OVS_UNLIKELY(ret)) {
> > +        umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > +        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > +                     ovs_strerror(ret));
> > +        return ret;
>
> I think we should still try to recover the CQ below, even on failure.
>
> > +    }
> > +
> > +retry:
> > +    /* Process CQ */
> > +    tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count,
> > &idx_cq);
> > +    if (tx_done > 0) {
> > +        xsk->outstanding_tx -= tx_done;
> > +        xsk->tx_npkts += tx_done;
> > +    }
> > +
> > +    /* Recycle back to umem pool */
> > +    for (j = 0; j < tx_done; j++) {
> > +        struct umem_elem *elem;
> > +        uint64_t addr;
> > +
> > +        addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > +        elem = ALIGNED_CAST(struct umem_elem *,
> > +                            (char *)xsk->umem->buffer + addr);
> > +        elems_push[j] = elem;
> > +    }
> > +
> > +    ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> > **)elems_push);
> > +    ovs_assert(ret == 0);
> > +
> > +    xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > +    if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2))
> > {
> > +        /* If there are still a lot not transmitted, try harder. */
> > +        if (retry_count++ > max_retry) {
> > +            return 0;
> > +        }
> > +        goto retry;
> > +    }
> > +
>
> I think the code above is causing my lockup at wire speed mentioned
> above...
> I guess the retry_count expires every transmit sending packets to the
> TAP interface.
> No all buffers are used... This is causing the umem_elem_pop_n() in the
> beginning to fail, hence the buffers are never returned!
>
> Guess we might need some reclaim in the beginning, or maybe even in the
> rx loop?

right, let me re-work this part of code.

> > +    return 0;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..6518d8fca0b5
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,53 @@
> > +/*
> > + * Copyright (c) 2018 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used
> > directly
> > + * only by Linux-specific code. */
>
> Extra enter?
>
> > +#define MAX_XSKQ 16
>
> Extra enter?
>

OK

> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +
> > +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet
> > *d);
> > +
> > +int xsk_configure_all(struct netdev *netdev);
> > +
> > +void xsk_destroy_all(struct netdev *netdev);
> > +
> > +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > +                         struct dp_packet_batch *batch);
> > +
> > +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > +                                  struct dp_packet_batch *batch);
> > +
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > *args,
> > +                            char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..3dd3d902b3c4
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,124 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Why include all the above? They where just added to netdev-linux.h, so
> if you make sure you include netdev-lunux.h before -private it should
> work out.
>
> > +
> > +#if HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> See earlier comment
>
> > +
> > +/* These functions are Linux specific, so they should be used
> > directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> > flag,
> > +                                  const char *flag_name, bool
> > enable);
> > +int linux_get_ifindex(const char *netdev_name);
> > +
>
> These functions are now both specified in netdev-linux.h and
> netdev-linux-private.h
>
> > +#define LINUX_FLOW_OFFLOAD_API                          \
> > +   .flow_flush = netdev_tc_flow_flush,                  \
> > +   .flow_dump_create = netdev_tc_flow_dump_create,      \
> > +   .flow_dump_destroy = netdev_tc_flow_dump_destroy,    \
> > +   .flow_dump_next = netdev_tc_flow_dump_next,          \
> > +   .flow_put = netdev_tc_flow_put,                      \
> > +   .flow_get = netdev_tc_flow_get,                      \
> > +   .flow_del = netdev_tc_flow_del,                      \
> > +   .init_flow_api = netdev_tc_init_flow_api
> > +
>
> Same here, this define is in both include files.
>
> > +struct netdev_linux {
> > +    struct netdev up;
> > +
> > +    /* Protects all members below. */
> > +    struct ovs_mutex mutex;
> > +
> > +    unsigned int cache_valid;
> > +
> > +    bool miimon;                    /* Link status of last poll. */
> > +    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > +    struct timer miimon_timer;
> > +
> > +    int netnsid;                    /* Network namespace ID. */
> > +    /* The following are figured out "on demand" only.  They are only
> > valid
> > +     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > +    int ifindex;
> > +    struct eth_addr etheraddr;
> > +    int mtu;
> > +    unsigned int ifi_flags;
> > +    long long int carrier_resets;
> > +    uint32_t kbits_rate;        /* Policing data. */
> > +    uint32_t kbits_burst;
> > +    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > +                                   0 or an errno value. */
> > +    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > +                                 * or SIOCSIFMTU.
> > +                                 */
> > +    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > +    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > +    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > +    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > +
> > +    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > +    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > +
> > +    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > +    struct tc *tc;
> > +
> > +    /* For devices of class netdev_tap_class only. */
> > +    int tap_fd;
> > +    bool present;               /* If the device is present in the
> > namespace */
> > +    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > +
> > +    /* LAG information. */
> > +    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > +
> > +    /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > +    struct xsk_socket_info *xsk[MAX_XSKQ];
> > +    int requested_n_rxq;
> > +    int xdpmode, requested_xdpmode; /* detect mode changed */
> > +    int xdp_flags, xdp_bind_flags;
> > +#endif
> > +};
> > +
> > +static struct netdev_linux *
> > +netdev_linux_cast(const struct netdev *netdev)
> > +{
>
> In the original definition there was an assert() here, was it removed by
> accident?
> netdev_linux_rxq_xsk
> > +    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +}
> > +
> > +#endif /* netdev-linux-private.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..1f190406d145 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -17,6 +17,7 @@
> >  #include <config.h>
> >
> >  #include "netdev-linux.h"
> > +#include "netdev-linux-private.h"
> >
> >  #include <errno.h>
> >  #include <fcntl.h>
> > @@ -54,6 +55,7 @@
> >  #include "fatal-signal.h"
> >  #include "hash.h"
> >  #include "openvswitch/hmap.h"
> > +#include "netdev-afxdp.h"
> >  #include "netdev-provider.h"
> >  #include "netdev-tc-offloads.h"
> >  #include "netdev-vport.h"
> > @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> >  static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int
> > mtu);
> >  static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> > burst_bytes);
> >
> > -struct netdev_linux {
> > -    struct netdev up;
> > -
> > -    /* Protects all members below. */
> > -    struct ovs_mutex mutex;
> > -
> > -    unsigned int cache_valid;
> > -
> > -    bool miimon;                    /* Link status of last poll. */
> > -    long long int miimon_interval;  /* Miimon Poll rate. Disabled if
> > <= 0. */
> > -    struct timer miimon_timer;
> > -
> > -    int netnsid;                    /* Network namespace ID. */
> > -    /* The following are figured out "on demand" only.  They are only
> > valid
> > -     * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > -    int ifindex;
> > -    struct eth_addr etheraddr;
> > -    int mtu;
> > -    unsigned int ifi_flags;
> > -    long long int carrier_resets;
> > -    uint32_t kbits_rate;        /* Policing data. */
> > -    uint32_t kbits_burst;
> > -    int vport_stats_error;      /* Cached error code from
> > vport_get_stats().
> > -                                   0 or an errno value. */
> > -    int netdev_mtu_error;       /* Cached error code from SIOCGIFMTU
> > or SIOCSIFMTU. */
> > -    int ether_addr_error;       /* Cached error code from set/get
> > etheraddr. */
> > -    int netdev_policing_error;  /* Cached error code from set
> > policing. */
> > -    int get_features_error;     /* Cached error code from
> > ETHTOOL_GSET. */
> > -    int get_ifindex_error;      /* Cached error code from
> > SIOCGIFINDEX. */
> > -
> > -    enum netdev_features current;    /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > -    enum netdev_features supported;  /* Cached from ETHTOOL_GSET. */
> > -
> > -    struct ethtool_drvinfo drvinfo;  /* Cached from ETHTOOL_GDRVINFO.
> > */
> > -    struct tc *tc;
> > -
> > -    /* For devices of class netdev_tap_class only. */
> > -    int tap_fd;
> > -    bool present;               /* If the device is present in the
> > namespace */
> > -    uint64_t tx_dropped;        /* tap device can drop if the iface
> > is down */
> > -
> > -    /* LAG information. */
> > -    bool is_lag_master;         /* True if the netdev is a LAG
> > master. */
> > -};
> >
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> > @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class
> > *netdev_class)
> >      return netdev_class->run == netdev_linux_run;
> >  }
> >
> > +#if HAVE_AF_XDP
> >  static bool
> > -is_tap_netdev(const struct netdev *netdev)
> > +is_afxdp_netdev(const struct netdev *netdev)
> >  {
> > -    return netdev_get_class(netdev) == &netdev_tap_class;
> > +    return netdev_get_class(netdev) == &netdev_afxdp_class;
> >  }
> > -
> > -static struct netdev_linux *
> > -netdev_linux_cast(const struct netdev *netdev)
> > +#else
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> >  {
> > -    ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > -
> > -    return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +    return false;
> > +}
> > +#endif
> > +static bool
> > +is_tap_netdev(const struct netdev *netdev)
> > +{
> > +    return netdev_get_class(netdev) == &netdev_tap_class;
> >  }
> >
> >  static struct netdev_rxq_linux *
> > @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
> >          atomic_count_dec(&miimon_cnt);
> >      }
> >
> > +#if HAVE_AF_XDP
> > +    if (is_afxdp_netdev(netdev_)) {
> > +        xsk_destroy_all(netdev_);
> > +    }
> > +#endif
>
> Think you can remove the HAVE_AF_XDP here, as you do not use it below
> either.

Yes

>
> >      ovs_mutex_destroy(&netdev->mutex);
> >  }
> >
> > @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq
> > *rxq_)
> >      rx->is_tap = is_tap_netdev(netdev_);
> >      if (rx->is_tap) {
> >          rx->fd = netdev->tap_fd;
> > -    } else {
> > +    } else if (!is_afxdp_netdev(netdev_)) {
> >          struct sockaddr_ll sll;
> >          int ifindex, val;
> >          /* Result of tcpdump -dd inbound */
> > @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> >  {
> >      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >      struct netdev *netdev = rx->up.netdev;
> > -    struct dp_packet *buffer;
> > +    struct dp_packet *buffer = NULL;
> >      ssize_t retval;
> >      int mtu;
> >
> > +#if HAVE_AF_XDP
>
> Think this #if HAVE_AF_XDP can be removed as the compiler should
> optimize out the if (false).
>
> > +    if (is_afxdp_netdev(netdev)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev);
> > +        int qid = rxq_->queue_id;
> > +
> > +        return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> > +    }
> > +#endif
> >      if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
> >          mtu = ETH_PAYLOAD_MAX;
> >      }
> > @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> >      /* Assume Ethernet port. No need to set packet_type. */
> >      buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> >                                             DP_NETDEV_HEADROOM);
> > +
> >      retval = (rx->is_tap
> >                ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> >                : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> >      int error = 0;
> >      int sock = 0;
> >
> > -    if (!is_tap_netdev(netdev_)) {
> > +    if (!is_tap_netdev(netdev_) &&
> > +        !is_afxdp_netdev(netdev_)) {
> >          if
> > (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >              error = EOPNOTSUPP;
> >              goto free_batch;
> > @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> >          }
> >
> >          error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +#if HAVE_AF_XDP
>
> Same here remove the #if HAVE_AF_XDP
>
> > +    } else if (is_afxdp_netdev(netdev_)) {
> > +        struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > +        struct dp_packet_afxdp *xpacket;
> > +        struct umem_pool *first_mpool;
> > +        struct dp_packet *packet;
> > +
> > +        error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> > +
> > +        /* all packets must come frome the same umem pool
> > +         * and has DPBUF_AFXDP type, otherwise free on-by-one
> > +         */
> > +        DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +            if (packet->source != DPBUF_AFXDP) {
> > +                goto free_batch;
> > +            }
> > +
> > +            xpacket = dp_packet_cast_afxdp(packet);
> > +            if (i == 0) {
> > +                first_mpool = xpacket->mpool;
> > +                continue;
> > +            }
> > +            if (xpacket->mpool != first_mpool) {
> > +                goto free_batch;
> > +            }
> > +        }
>
> Why do not we not move all the packet type checks to
> free_afxdp_buf_batch()?

Here I plan to move them into the afxdp-specific send function.
So it won't be part of netdev_linux_send(), hopefully it's more clean.
>
> > +        /* free in batch */
> > +        free_afxdp_buf_batch(batch);
> > +        return error;
> > +#endif
> >      } else {
> >          error = netdev_linux_tap_batch_send(netdev_, batch);
> >      }
> > @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "system",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_linux_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
> >  const struct netdev_class netdev_tap_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      .type = "tap",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct_tap,
> >      .get_stats = netdev_tap_get_stats,
> >      .get_features = netdev_linux_get_features,
> > @@ -3343,10 +3352,26 @@ const struct netdev_class
> > netdev_internal_class = {
> >      NETDEV_LINUX_CLASS_COMMON,
> >      LINUX_FLOW_OFFLOAD_API,
> >      .type = "internal",
> > +    .is_pmd = false,
> >      .construct = netdev_linux_construct,
> >      .get_stats = netdev_internal_get_stats,
> >      .get_status = netdev_internal_get_status,
> >  };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > +    NETDEV_LINUX_CLASS_COMMON,
> > +    .type = "afxdp",
> > +    .is_pmd = true,
> > +    .construct = netdev_linux_construct,
> > +    .get_stats = netdev_linux_get_stats,
> > +    .get_status = netdev_linux_get_status,
> > +    .set_config = netdev_afxdp_set_config,
> > +    .get_config = netdev_afxdp_get_config,
> > +    .reconfigure = netdev_afxdp_reconfigure,
> > +    .get_numa_id = netdev_afxdp_get_numa_id,
> > +};
> > +#endif
> >
> >
> >  #define CODEL_N_QUEUES 0x0000
> > diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> > index 17ca9120168a..b812e64cb078 100644
> > --- a/lib/netdev-linux.h
> > +++ b/lib/netdev-linux.h
> > @@ -19,6 +19,20 @@
> >
> >  #include <stdint.h>
> >  #include <stdbool.h>
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Is there a reason why you move all these includes here? If there is you
> might as well remove the duplicates from .c files that include
> netdev-linux.h, for example, netdev-linux.c

<snip>
Will work on this part later.

> > +static inline void
> > +ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > +    atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED
> > +ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > +    int exp = 0;
> > +    return atomic_compare_exchange_strong_explicit(&sl->locked, &exp,
> > 1,
> > +                memory_order_acquire,
> > +                memory_order_relaxed);
> > +}
>
> Move spinlock function out to a common file
>
OK, I plan to add lib/spinlock.h and move to it.

> > +
> > +inline int
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>
> This is a stack overflow
>
> > +        return -ENOMEM;
> > +    }
> > +
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(ptr, addrs, n * sizeof(void *));
> > +    umemp->index += n;
> > +
> > +    return 0;
> > +}
> > +
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_push_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +    umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > +    if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> > +        /* stack is overflow, this should not happen */
> > +        OVS_NOT_REACHED();
> > +    }
>
> Should this not be moved after the spinlock, i.e. to __umem_elem_push
OK

>
> > +
> > +    ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    __umem_elem_push(umemp, addr);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    void *ptr;
> > +
> > +    if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > +        return -ENOMEM;
> > +    }
> > +
> > +    umemp->index -= n;
> > +    ptr = &umemp->array[umemp->index];
> > +    memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > +    return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > +    int ret;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ret = __umem_elem_pop_n(umemp, n, addrs);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ret;
> > +}
> > +
> > +inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
>
> There is no check here to see if there are actual any elements left,
> like there is for pop_n,
> so we could corrupt memory/umem_pool
>
Yes, I will add a check.

> > +    return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > +    void *ptr;
> > +
> > +    ovs_spin_lock(&umemp->mutex);
> > +    ptr = __umem_elem_pop(umemp);
> > +    ovs_spin_unlock(&umemp->mutex);
> > +
> > +    return ptr;
> > +}
> > +
> > +void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(void *)) == 0);
>
> We should not assert, just return NULL here.
>
> > +    memset(bufs, 0, size * sizeof(void *));
> > +    return (void **)bufs;
> > +}
> > +
> > +unsigned int
> > +umem_elem_count(struct umem_pool *mpool)
> > +{
> > +    return mpool->index;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > +    umemp->array = __umem_pool_alloc(size);
> > +    if (!umemp->array) {
> > +        OVS_NOT_REACHED();
>
> If NULL is returned return ENOMEM
>
> > +    }
> > +
> > +    umemp->size = size;
> > +    umemp->index = 0;
> > +    ovs_spinlock_init(&umemp->mutex);
> > +    return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > +    free(umemp->array);
>         umemp->array = NULL;
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > +    void *bufs;
> > +
> > +    /* TODO: check HAVE_POSIX_MEMALIGN  */
>
> Guess the above needs to be done
>
> > +    ovs_assert(posix_memalign(&bufs, getpagesize(),
> > +                              size * sizeof(struct dp_packet_afxdp))
> > == 0);
>
> We should not assert, just return false
>
> > +    memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > +    xp->array = bufs;
> > +    xp->size = size;
> > +    return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > +    free(xp->array);
>         xp->array = NULL;
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..aabaa8e5df24
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,123 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <bpf/libbpf.h>
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <getopt.h>
> > +#include <libgen.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/if_ether.h>
> > +#include <locale.h>
> > +#include <net/if.h>
> > +#include <poll.h>
> > +#include <pthread.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <sys/mman.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +
> > +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define BATCH_SIZE      NETDEV_MAX_BURST
>
> Move this item to the bottom, so you have FRAME specific define's first
>
> > +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define NUM_FRAMES      4096
>
> Should we add a note/check to make sure this value is a power of 2?

Sure, will do it.

>
> > +#define PROD_NUM_DESCS  512
> > +#define CONS_NUM_DESCS  512
> > +
> > +#ifdef USE_XSK_DEFAULT
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#endif
>
> Any reason for having this? Should we use the default values? They are
> 4x larger than you have, did it make any difference in performance
> results?
> We could make it configurable like for DPDK, using the
> n_txq_desc/n_rxq_desc option.

Will add this option later.

>
> > +
> > +typedef struct {
> > +    atomic_int locked;
> > +} ovs_spinlock_t;
> > +
>
> Think we should move the ovs_spinlock code and includes to some global
> place, maybe util or thread

ok, will move to lib/spinlock.h

>
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > +    int index;      /* point to top */
> > +    unsigned int size;
> > +    ovs_spinlock_t mutex;
> > +    void **array;   /* a pointer array, point to umem buf */
> > +};
> > +
> > +/* array-based dp_packet_afxdp */
> > +struct xpacket_pool {
> > +    unsigned int size;
> > +    struct dp_packet_afxdp **array;
> > +};
> > +
> > +struct xsk_umem_info {
> > +    struct umem_pool mpool;
> > +    struct xpacket_pool xpool;
> > +    struct xsk_ring_prod fq;
> > +    struct xsk_ring_cons cq;
> > +    struct xsk_umem *umem;
> > +    void *buffer;
> > +};
> > +
> > +struct xsk_socket_info {
> > +    struct xsk_ring_cons rx;
> > +    struct xsk_ring_prod tx;
> > +    struct xsk_umem_info *umem;
> > +    struct xsk_socket *xsk;
> > +    unsigned long rx_npkts;
> > +    unsigned long tx_npkts;
> > +    unsigned long prev_rx_npkts;
> > +    unsigned long prev_tx_npkts;
> > +    uint32_t outstanding_tx;
> > +};
> > +
> > +struct umem_elem {
> > +    struct umem_elem *next;
> > +};
> > +
> > +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> > +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void *__umem_elem_pop(struct umem_pool *umemp);
> > +void *umem_elem_pop(struct umem_pool *umemp);
> > +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void **__umem_pool_alloc(unsigned int size);
> > +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > +void umem_pool_cleanup(struct umem_pool *umemp);
> > +unsigned int umem_elem_count(struct umem_pool *mpool);
> > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > +
>
> Think all the __umem_* function are only used internally so they should
> be come static and be removed here.
>
OK.

Regards,
William


More information about the dev mailing list