[ovs-dev] [PATCHv13] netdev-afxdp: add new netdev type for AF_XDP.

Ilya Maximets i.maximets at samsung.com
Thu Jun 27 17:07:23 UTC 2019


Just a few comments inline.

Best regards, Ilya Maximets.

On 19.06.2019 22:51, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063 at gmail.com>
> ---
> v1->v2:
> - add a list to maintain unused umem elements
> - remove copy from rx umem to ovs internal buffer
> - use hugetlb to reduce misses (not much difference)
> - use pmd mode netdev in OVS (huge performance improve)
> - remove malloc dp_packet, instead put dp_packet in umem
> 
> v2->v3:
> - rebase on the OVS master, 7ab4b0653784
>   ("configure: Check for more specific function to pull in pthread library.")
> - remove the dependency on libbpf and dpif-bpf.
>   instead, use the built-in XDP_ATTACH feature.
> - data structure optimizations for better performance, see[1]
> - more test cases support
> v3: https://mail.openvswitch.org/pipermail/ovs-dev/2018-November/354179.html
> 
> v3->v4:
> - Use AF_XDP API provided by libbpf
> - Remove the dependency on XDP_ATTACH kernel patch set
> - Add documentation, bpf.rst
> 
> v4->v5:
> - rebase to master
> - remove rfc, squash all into a single patch
> - add --enable-afxdp, so by default, AF_XDP is not compiled
> - add options: xdpmode=drv,skb
> - add multiple queue and multiple PMD support, with options: n_rxq
> - improve documentation, rename bpf.rst to af_xdp.rst
> 
> v5->v6
> - rebase to master, commit 0cdd5b13de91b98
> - address errors from sparse and clang
> - pass travis-ci test
> - address feedback from Ben
> - fix issues reported by 0-day robot
> - improved documentation
> 
> v6-v7
> - rebase to master, commit abf11558c1515bf3b1
> - address feedbacks from Ilya, Ben, and Eelco, see:
>   https://www.mail-archive.com/ovs-dev@openvswitch.org/msg32357.html
> - add XDP mode change, implement get/set_config, reconfigure
> - Fix reconfiguration/crash issue caused by libbpf, see patch:
>   [PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
> - perf optimization for batching umem_push/pop
> - perf optimization for batching kick_tx
> - test build with dpdk
> - fix/refactor atomic operation
> - make AF_XDP x86 specific, otherwise fail at build time
> - lots of code refactoring
> - add PVP setup in documentation
> 
> v7-v8:
> - Address feedback from Ilya at:
>   https://patchwork.ozlabs.org/patch/1095019/
> - add netdev-linux-private.h
> - fix afxdp reconfigure issue
> - sort include headers
> - remove unnecessary OVS_UNUSED
> - coding style fixes
> - error case handling and memory leak
> 
> v8-v9:
> - rebase to master 180bbbed3a3867d52
> - Address review feedback from Ben, Ilya and Eelco, at:
>   https://patchwork.ozlabs.org/patch/1097740/
> - == From Ilya ==
> - Optimize the reconfiguration logic
> - Implement .rxq_recv and .send for afxdp
> - Remove system-afxdp-traffic.at, reuse existing code
> - Use Ilya's rdtsc code
> - remove --disable-system
> - == From Eelco ==
> - Fix bug when remove br0, util(revalidator49)|EMER|lib/poll-loop.c:111:
>   assertion !fd != !wevent failed
> - Fix bug and use default value from libbpf, ex: XSK_RING_PROD__DEFAULT...
> - Clear xdp program when receive signal, ctrl+c
> - Add options to vswitch.xml, set xdpmode default to skb-mode
> - No support for ARM and PPC, now x86_64 only
> - remove redundant header includes and function/macro definitions
> - remove some ifdef HAVE_AF_XDP
> - == From others/both about afxdp rx and tx ==
> - Several umem push/pop error handling improvement/fixes
> - add lock to address concurrent_txq case
> - improve error handling
> - add stats
> - Things that are not done yet
> - MTU limitation
> - n_txq_desc/n_rxq_desc option.
> 
> v9-v10
> - remove x86_64 limitation, suggested by Ben and Eelco
> - add xmalloc_pagealign, free_pagealign
> - minor refector
> 
> v10-v11
> - address feedback from Ilya at
>   https://patchwork.ozlabs.org/patch/1106495/
> - fix typos, and some refactoring
> - refactor existing code and introduce xmalloc pagealign
> - fix a couple of error handling case
> - allocate per-txq lock
> - dynamic allocate xsk array
> - fix cycle_counter_update() for non-x86/non-linux case
> 
> v11-v12
> - mainly address a couple of crashes reported by Eelco
>   https://patchwork.ozlabs.org/patch/1110729/
> - fix cleanup xdp program problem when ovs-vswtichd restarts
> - following cases should remove xdp program
>   - kill `pidof ovs-vswitchd`
>   - ovs-appctl -t ovs-vswtichd exit --cleanup
>   - note: ovs-ctl restart does not have "--cleanup" so still an issue
> - work around issues of xsk_ring_cons__peek at libbpf, reported at
>   https://marc.info/?l=xdp-newbies&m=156055471727857&w=2
> - variable name refactoring
> - there are some performance degradation, but let's make sure
>   everything works first
> 
> v12-v13
> - rebase to master
> - add coverage counter afxdp_cq_emtpy, afxdp_fq_full
> - minor refactoring
> ---
>  Documentation/automake.mk             |   1 +
>  Documentation/index.rst               |   1 +
>  Documentation/intro/install/afxdp.rst | 425 ++++++++++++++++
>  Documentation/intro/install/index.rst |   1 +
>  acinclude.m4                          |  35 ++
>  configure.ac                          |   1 +
>  lib/automake.mk                       |  14 +
>  lib/dp-packet.c                       |  28 ++
>  lib/dp-packet.h                       |  18 +-
>  lib/dpif-netdev-perf.h                |  26 +
>  lib/netdev-afxdp.c                    | 891 ++++++++++++++++++++++++++++++++++
>  lib/netdev-afxdp.h                    |  74 +++
>  lib/netdev-linux-private.h            | 138 ++++++
>  lib/netdev-linux.c                    | 121 ++---
>  lib/netdev-provider.h                 |   3 +
>  lib/netdev.c                          |  11 +
>  lib/spinlock.h                        |  70 +++
>  lib/util.c                            |  92 +++-
>  lib/util.h                            |   5 +
>  lib/xdpsock.c                         | 170 +++++++
>  lib/xdpsock.h                         | 101 ++++
>  tests/automake.mk                     |  16 +
>  tests/system-afxdp-macros.at          |  20 +
>  tests/system-afxdp-testsuite.at       |  26 +
>  vswitchd/vswitch.xml                  |  30 ++
>  25 files changed, 2210 insertions(+), 108 deletions(-)
>  create mode 100644 Documentation/intro/install/afxdp.rst
>  create mode 100644 lib/netdev-afxdp.c
>  create mode 100644 lib/netdev-afxdp.h
>  create mode 100644 lib/netdev-linux-private.h
>  create mode 100644 lib/spinlock.h
>  create mode 100644 lib/xdpsock.c
>  create mode 100644 lib/xdpsock.h
>  create mode 100644 tests/system-afxdp-macros.at
>  create mode 100644 tests/system-afxdp-testsuite.at
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 082438e09a33..11cc59efc881 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -10,6 +10,7 @@ DOC_SOURCE = \
>  	Documentation/intro/why-ovs.rst \
>  	Documentation/intro/install/index.rst \
>  	Documentation/intro/install/bash-completion.rst \
> +	Documentation/intro/install/afxdp.rst \
>  	Documentation/intro/install/debian.rst \
>  	Documentation/intro/install/documentation.rst \
>  	Documentation/intro/install/distributions.rst \
> diff --git a/Documentation/index.rst b/Documentation/index.rst
> index 46261235c732..aa9e7c49f179 100644
> --- a/Documentation/index.rst
> +++ b/Documentation/index.rst
> @@ -59,6 +59,7 @@ vSwitch? Start here.
>    :doc:`intro/install/windows` |
>    :doc:`intro/install/xenserver` |
>    :doc:`intro/install/dpdk` |
> +  :doc:`intro/install/afxdp` |
>    :doc:`Installation FAQs <faq/releases>`
>  
>  - **Tutorials:** :doc:`tutorials/faucet` |
> diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst
> new file mode 100644
> index 000000000000..291df8d45020
> --- /dev/null
> +++ b/Documentation/intro/install/afxdp.rst
> @@ -0,0 +1,425 @@
> +..
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +
> +========================
> +Open vSwitch with AF_XDP
> +========================
> +
> +This document describes how to build and install Open vSwitch using
> +AF_XDP netdev.
> +
> +.. warning::
> +  The AF_XDP support of Open vSwitch is considered 'experimental',
> +  and it is not compiled in by default.
> +
> +
> +Introduction
> +------------
> +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type
> +built upon the eBPF and XDP technology.  It is aims to have comparable
> +performance to DPDK but cooperate better with existing kernel's networking
> +stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> +attached to the netdev, by-passing a couple of Linux kernel's subsystems.
> +As a result, AF_XDP socket shows much better performance than AF_PACKET.
> +For more details about AF_XDP, please see linux kernel's
> +Documentation/networking/af_xdp.rst
> +
> +
> +AF_XDP Netdev
> +-------------
> +OVS has a couple of netdev types, i.e., system, tap, or
> +dpdk.  The AF_XDP feature adds a new netdev types called
> +"afxdp", and implement its configuration, packet reception,
> +and transmit functions.  Since the AF_XDP socket, called xsk,
> +operates in userspace, once ovs-vswitchd receives packets
> +from xsk, the afxdp netdev re-uses the existing userspace
> +dpif-netdev datapath.  As a result, most of the packet processing
> +happens at the userspace instead of linux kernel.
> +
> +::
> +
> +              |   +-------------------+
> +              |   |    ovs-vswitchd   |<-->ovsdb-server
> +              |   +-------------------+
> +              |   |      ofproto      |<-->OpenFlow controllers
> +              |   +--------+-+--------+
> +              |   | netdev | |ofproto-|
> +    userspace |   +--------+ |  dpif  |
> +              |   | afxdp  | +--------+
> +              |   | netdev | |  dpif  |
> +              |   +---||---+ +--------+
> +              |       ||     |  dpif- |
> +              |       ||     | netdev |
> +              |_      ||     +--------+
> +                      ||
> +               _  +---||-----+--------+
> +              |   | AF_XDP prog +     |
> +       kernel |   |   xsk_map         |
> +              |_  +--------||---------+
> +                           ||
> +                        physical
> +                           NIC
> +
> +
> +Build requirements
> +------------------
> +
> +In addition to the requirements described in :doc:`general`, building Open
> +vSwitch with AF_XDP will require the following:
> +
> +- libbpf from kernel source tree (kernel 5.0.0 or later)
> +
> +- Linux kernel XDP support, with the following options (required)
> +
> +  * CONFIG_BPF=y
> +
> +  * CONFIG_BPF_SYSCALL=y
> +
> +  * CONFIG_XDP_SOCKETS=y
> +
> +
> +- The following optional Kconfig options are also recommended, but not
> +  required:
> +
> +  * CONFIG_BPF_JIT=y (Performance)
> +
> +  * CONFIG_HAVE_BPF_JIT=y (Performance)
> +
> +  * CONFIG_XDP_SOCKETS_DIAG=y (Debugging)
> +
> +- Once your AF_XDP-enabled kernel is ready, if possible, run
> +  **./xdpsock -r -N -z -i <your device>** under linux/samples/bpf.
> +  This is an OVS independent benchmark tools for AF_XDP.
> +  It makes sure your basic kernel requirements are met for AF_XDP.
> +
> +
> +Installing
> +----------
> +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support.
> +First, clone a recent version of Linux bpf-next tree::
> +
> +  git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
> +
> +Second, go into the Linux source directory and build libbpf in the tools
> +directory::
> +
> +  cd bpf-next/
> +  cd tools/lib/bpf/
> +  make && make install
> +  make install_headers
> +
> +.. note::
> +   Make sure xsk.h and bpf.h are installed in system's library path,
> +   e.g. /usr/local/include/bpf/ or /usr/include/bpf/
> +
> +Make sure the libbpf.so is installed correctly::
> +
> +  ldconfig
> +  ldconfig -p | grep libbpf
> +
> +Third, ensure the standard OVS requirements are installed and
> +bootstrap/configure the package::
> +
> +  ./boot.sh && ./configure --enable-afxdp
> +
> +Finally, build and install OVS::
> +
> +  make && make install
> +
> +To kick start end-to-end autotesting::
> +
> +  uname -a # make sure having 5.0+ kernel
> +  make check-afxdp TESTSUITEFLAGS='1'
> +
> +If a test case fails, check the log at::
> +
> +  cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log
> +
> +
> +Setup AF_XDP netdev
> +-------------------
> +Before running OVS with AF_XDP, make sure the libbpf and libelf are
> +set-up right::
> +
> +  ldd vswitchd/ovs-vswitchd
> +
> +Open vSwitch should be started using userspace datapath as described
> +in :doc:`general`::
> +
> +  ovs-vswitchd ...
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4)
> +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb"::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Or, use 4 pmds/cores and 4 queues by doing::
> +
> +  ethtool -L enp2s0 combined 4
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=4 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> +
> +.. note::
> +   pmd-rxq-affinity is optional. If not specified, system will auto-assign.
> +
> +To validate that the bridge has successfully instantiated, you can use the::
> +
> +  ovs-vsctl show
> +
> +Should show something like::
> +
> +  Port "ens802f0"
> +   Interface "ens802f0"
> +      type: afxdp
> +      options: {n_rxq="1", xdpmode=drv}
> +
> +Otherwise, enable debugging by::
> +
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +
> +References
> +----------
> +Most of the design details are described in the paper presented at
> +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> +section 4, and slides[2][4].
> +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction
> +about AF_XDP current and future work.
> +
> +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> +
> +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> +
> +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> +
> +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> +
> +
> +Performance Tuning
> +------------------
> +The name of the game is to keep your CPU running in userspace, allowing PMD
> +to keep polling the AF_XDP queues without any interferences from kernel.
> +
> +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd
> +   running cores, device plug-in slot)
> +
> +#. Isolate your CPU by doing isolcpu at grub configure.
> +
> +#. IRQ should not set to pmd running core.
> +
> +#. The Spectre and Meltdown fixes increase the overhead of system calls.
> +
> +
> +Debugging performance issue
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +While running the traffic, use linux perf tool to see where your cpu
> +spends its cycle::
> +
> +  cd bpf-next/tools/perf
> +  make
> +  ./perf record -p `pidof ovs-vswitchd` sleep 10
> +  ./perf report
> +
> +Measure your system call rate by doing::
> +
> +  pstree -p `pidof ovs-vswitchd`
> +  strace -c -p <your pmd's PID>
> +
> +Or, use OVS pmd tool::
> +
> +  ovs-appctl dpif-netdev/pmd-stats-show
> +
> +
> +Example Script
> +--------------
> +
> +Below is a script using namespaces and veth peer::
> +
> +  #!/bin/bash
> +  ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \
> +    --disable-system --detach \
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 \
> +    protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \
> +    fail-mode=secure datapath_type=netdev
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev
> +
> +  ip netns add at_ns0
> +  ovs-appctl vlog/set netdev_afxdp::dbg
> +
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp"
> +
> +  ip netns exec at_ns0 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.1/24" dev p0
> +  ip link set dev p0 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns add at_ns1
> +  ip link add p1 type veth peer name afxdp-p1
> +  ip link set p1 netns at_ns1
> +  ip link set dev afxdp-p1 up
> +
> +  ovs-vsctl add-port br0 afxdp-p1 -- \
> +    set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp"
> +  ip netns exec at_ns1 sh << NS_EXEC_HEREDOC
> +  ip addr add "10.1.1.2/24" dev p1
> +  ip link set dev p1 up
> +  NS_EXEC_HEREDOC
> +
> +  ip netns exec at_ns0 ping -i .2 10.1.1.2
> +
> +
> +Limitations/Known Issues
> +------------------------
> +#. Device's numa ID is always 0, need a way to find numa id from a netdev.
> +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible
> +   work-around is to use OpenFlow meter action.
> +#. AF_XDP device added to bridge, remove, and added again will fail.
> +#. Most of the tests are done using i40e single port. Multiple ports and
> +   also ixgbe driver also needs to be tested.
> +#. No latency test result (TODO items)
> +
> +
> +PVP using tap device
> +--------------------
> +Assume you have enp2s0 as physical nic, and a tap device connected to VM.
> +First, start OVS, then add physical port::
> +
> +  ethtool -L enp2s0 combined 1
> +  ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> +  ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \
> +    options:n_rxq=1 options:xdpmode=drv \
> +    other_config:pmd-rxq-affinity="0:4"
> +
> +Start a VM with virtio and tap device::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +    -m 4096 \
> +    -cpu host,+x2apic -enable-kvm \
> +    -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\
> +      vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \
> +    -netdev type=tap,id=net0,vhost=on,queues=8 \
> +    -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +    -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Create OpenFlow rules::
> +
> +  ovs-vsctl add-port br0 tap0 -- set interface tap0
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> +  ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +
> +PVP using vhostuser device
> +--------------------------
> +First, build OVS with DPDK and AFXDP::
> +
> +  ./configure  --enable-afxdp --with-dpdk=<dpdk path>
> +  make -j4 && make install
> +
> +Create a vhost-user port from OVS::
> +
> +  ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true
> +  ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \
> +    other_config:pmd-cpu-mask=0xfff
> +  ovs-vsctl add-port br0 vhost-user-1 \
> +    -- set Interface vhost-user-1 type=dpdkvhostuser
> +
> +Start VM using vhost-user mode::
> +
> +  qemu-system-x86_64 -hda ubuntu1810.qcow \
> +   -m 4096 \
> +   -cpu host,+x2apic -enable-kvm \
> +   -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \
> +   -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \
> +   -device virtio-net-pci,mac=00:00:00:00:00:01,\
> +      netdev=mynet1,mq=on,vectors=10 \
> +   -object memory-backend-file,id=mem,size=4096M,\
> +      mem-path=/dev/hugepages,share=on \
> +   -numa node,memdev=mem -mem-prealloc -smp 2
> +
> +Setup the OpenFlow ruls::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1"
> +  ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0"
> +
> +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic::
> +
> +  ./xdp_rxq_info --dev ens3 --action XDP_DROP
> +  ./xdp_rxq_info --dev ens3 --action XDP_TX
> +
> +
> +PCP container using veth
> +------------------------
> +Create namespace and veth peer devices::
> +
> +  ip netns add at_ns0
> +  ip link add p0 type veth peer name afxdp-p0
> +  ip link set p0 netns at_ns0
> +  ip link set dev afxdp-p0 up
> +  ip netns exec at_ns0 ip link set dev p0 up
> +
> +Attach the veth port to br0 (linux kernel mode)::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 options:n_rxq=1
> +
> +Or, use AF_XDP with skb mode::
> +
> +  ovs-vsctl add-port br0 afxdp-p0 -- \
> +    set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb
> +
> +Setup the OpenFlow rules::
> +
> +  ovs-ofctl del-flows br0
> +  ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0"
> +  ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0"
> +
> +In the namespace, run drop or bounce back the packet::
> +
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP
> +  ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX
> +
> +
> +Bug Reporting
> +-------------
> +
> +Please report problems to dev at openvswitch.org.
> diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst
> index 3193c736cf17..c27a9c9d16ff 100644
> --- a/Documentation/intro/install/index.rst
> +++ b/Documentation/intro/install/index.rst
> @@ -45,6 +45,7 @@ Installation from Source
>     xenserver
>     userspace
>     dpdk
> +   afxdp
>  
>  Installation from Packages
>  --------------------------
> diff --git a/acinclude.m4 b/acinclude.m4
> index 321a741985db..bb03b504a2a8 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -238,6 +238,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [
>    ])
>  ])
>  
> +dnl OVS_CHECK_LINUX_AF_XDP
> +dnl
> +dnl Check both Linux kernel AF_XDP and libbpf support
> +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [
> +  AC_ARG_ENABLE([afxdp],
> +                [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])],
> +                [], [enable_afxdp=no])
> +  AC_MSG_CHECKING([whether AF_XDP is enabled])
> +  if test "$enable_afxdp" != yes; then
> +    AC_MSG_RESULT([no])
> +    AF_XDP_ENABLE=false
> +  else
> +    AC_MSG_RESULT([yes])
> +    AF_XDP_ENABLE=true
> +
> +    AC_CHECK_HEADER([bpf/libbpf.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([linux/if_xdp.h], [],
> +      [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/xsk.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])])
> +
> +    AC_CHECK_HEADER([bpf/libbpf_util.h], [],
> +      [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])])
> +
> +    AC_DEFINE([HAVE_AF_XDP], [1],
> +              [Define to 1 if AF_XDP support is available and enabled.])
> +    LIBBPF_LDADD=" -lbpf -lelf"
> +    AC_SUBST([LIBBPF_LDADD])
> +  fi
> +  AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true)
> +])
> +
>  dnl OVS_CHECK_DPDK
>  dnl
>  dnl Configure DPDK source tree
> diff --git a/configure.ac b/configure.ac
> index a9f0a06dc140..36ad246203db 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -98,6 +98,7 @@ OVS_CHECK_SPHINX
>  OVS_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
> +OVS_CHECK_LINUX_AF_XDP
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include <signal.h>]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>    [], [], [[#include <sys/stat.h>]])
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 1b89cac8c3a2..9b75e47ba396 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -14,6 +14,10 @@ if WIN32
>  lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD)
> +endif
> +
>  lib_libopenvswitch_la_LDFLAGS = \
>          $(OVS_LTINFO) \
>          -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \
> @@ -394,6 +398,7 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/if-notifier.h \
>  	lib/netdev-linux.c \
>  	lib/netdev-linux.h \
> +	lib/netdev-linux-private.h \
>  	lib/netdev-offload-tc.c \
>  	lib/netlink-conntrack.c \
>  	lib/netlink-conntrack.h \
> @@ -410,6 +415,15 @@ lib_libopenvswitch_la_SOURCES += \
>  	lib/tc.h
>  endif
>  
> +if HAVE_AF_XDP
> +lib_libopenvswitch_la_SOURCES += \
> +	lib/xdpsock.c \
> +	lib/xdpsock.h \
> +	lib/netdev-afxdp.c \
> +	lib/netdev-afxdp.h \
> +	lib/spinlock.h
> +endif
> +
>  if DPDK_NETDEV
>  lib_libopenvswitch_la_SOURCES += \
>  	lib/dpdk.c \
> diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> index 0976a35e758b..e6a7947076b4 100644
> --- a/lib/dp-packet.c
> +++ b/lib/dp-packet.c
> @@ -19,6 +19,7 @@
>  #include <string.h>
>  
>  #include "dp-packet.h"
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/dynamic-string.h"
>  #include "util.h"
> @@ -59,6 +60,27 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated)
>      dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
>  }
>  
> +#if HAVE_AF_XDP
> +/* Initialize 'b' as an empty dp_packet that contains
> + * memory starting at AF_XDP umem base.
> + */
> +void
> +dp_packet_use_afxdp(struct dp_packet *b, void *base, size_t allocated)
> +{
> +    dp_packet_set_base(b, base);
> +    dp_packet_set_data(b, base);
> +    dp_packet_set_size(b, 0);
> +
> +    dp_packet_set_allocated(b, allocated);
> +    b->source = DPBUF_AFXDP;
> +    dp_packet_reset_offsets(b);
> +    pkt_metadata_init(&b->md, 0);
> +    dp_packet_reset_cutlen(b);
> +    dp_packet_reset_offload(b);
> +    b->packet_type = htonl(PT_ETH);
> +}
> +#endif
> +
>  /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of
>   * memory starting at 'base'.  'base' should point to a buffer on the stack.
>   * (Nothing actually relies on 'base' being allocated on the stack.  It could
> @@ -122,6 +144,8 @@ dp_packet_uninit(struct dp_packet *b)
>               * created as a dp_packet */
>              free_dpdk_buf((struct dp_packet*) b);
>  #endif
> +        } else if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
>          }
>      }
>  }
> @@ -248,6 +272,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom
>      case DPBUF_STACK:
>          OVS_NOT_REACHED();
>  
> +    case DPBUF_AFXDP:
> +        OVS_NOT_REACHED();
> +
>      case DPBUF_STUB:
>          b->source = DPBUF_MALLOC;
>          new_base = xmalloc(new_allocated);
> @@ -433,6 +460,7 @@ dp_packet_steal_data(struct dp_packet *b)
>  {
>      void *p;
>      ovs_assert(b->source != DPBUF_DPDK);
> +    ovs_assert(b->source != DPBUF_AFXDP);
>  
>      if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) {
>          p = dp_packet_data(b);
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index a5e9ade1244a..e3438226e360 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -25,6 +25,7 @@
>  #include <rte_mbuf.h>
>  #endif
>  
> +#include "netdev-afxdp.h"
>  #include "netdev-dpdk.h"
>  #include "openvswitch/list.h"
>  #include "packets.h"
> @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source {
>      DPBUF_DPDK,                /* buffer data is from DPDK allocated memory.
>                                  * ref to dp_packet_init_dpdk() in dp-packet.c.
>                                  */
> +    DPBUF_AFXDP,               /* buffer data from XDP frame */
>  };
>  
>  #define DP_PACKET_CONTEXT_SIZE 64
> @@ -89,6 +91,13 @@ struct dp_packet {
>      };
>  };
>  
> +#if HAVE_AF_XDP
> +struct dp_packet_afxdp {
> +    struct umem_pool *mpool;
> +    struct dp_packet packet;
> +};
> +#endif
> +
>  static inline void *dp_packet_data(const struct dp_packet *);
>  static inline void dp_packet_set_data(struct dp_packet *, void *);
>  static inline void *dp_packet_base(const struct dp_packet *);
> @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *);
>  void dp_packet_use(struct dp_packet *, void *, size_t);
>  void dp_packet_use_stub(struct dp_packet *, void *, size_t);
>  void dp_packet_use_const(struct dp_packet *, const void *, size_t);
> -
> +#if HAVE_AF_XDP
> +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t);
> +#endif
>  void dp_packet_init_dpdk(struct dp_packet *);
>  
>  void dp_packet_init(struct dp_packet *, size_t);
> @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b)
>              return;
>          }
>  
> +        if (b->source == DPBUF_AFXDP) {
> +            free_afxdp_buf(b);
> +            return;
> +        }
> +
>          dp_packet_uninit(b);
>          free(b);
>      }
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 859c05613ddf..6b6dfda7db1c 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -21,6 +21,7 @@
>  #include <stddef.h>
>  #include <stdint.h>
>  #include <string.h>
> +#include <time.h>
>  #include <math.h>
>  
>  #ifdef DPDK_NETDEV
> @@ -186,6 +187,24 @@ struct pmd_perf_stats {
>      char *log_reason;
>  };
>  
> +#ifdef __linux__
> +static inline uint64_t
> +rdtsc_syscall(struct pmd_perf_stats *s)
> +{
> +    struct timespec val;
> +    uint64_t v;
> +
> +    if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) {
> +       return s->last_tsc;
> +    }
> +
> +    v  = (uint64_t) val.tv_sec * 1000000000LL;
> +    v += (uint64_t) val.tv_nsec;
> +
> +    return s->last_tsc = v;
> +}
> +#endif
> +
>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
>   * These functions are intended to be invoked in the context of pmd threads. */
>  
> @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s)
>  {
>  #ifdef DPDK_NETDEV
>      return s->last_tsc = rte_get_tsc_cycles();
> +#elif !defined(_MSC_VER) && defined(__x86_64__)
> +    uint32_t h, l;
> +    asm volatile("rdtsc" : "=a" (l), "=d" (h));
> +
> +    return s->last_tsc = ((uint64_t) h << 32) | l;
> +#elif defined(__linux__)
> +    return rdtsc_syscall(s);
>  #else
>      return s->last_tsc = 0;
>  #endif
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> new file mode 100644
> index 000000000000..33d8612153d5
> --- /dev/null
> +++ b/lib/netdev-afxdp.c
> @@ -0,0 +1,891 @@
> +/*
> + * Copyright (c) 2018, 2019 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "netdev-linux-private.h"
> +#include "netdev-linux.h"
> +#include "netdev-afxdp.h"
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <stdlib.h>
> +#include <sys/resource.h>
> +#include <sys/socket.h>
> +#include <sys/types.h>
> +#include <unistd.h>
> +
> +#include "coverage.h"
> +#include "dp-packet.h"
> +#include "dpif-netdev.h"
> +#include "openvswitch/dynamic-string.h"
> +#include "openvswitch/vlog.h"
> +#include "packets.h"
> +#include "socket-util.h"
> +#include "spinlock.h"
> +#include "util.h"
> +#include "xdpsock.h"
> +
> +#ifndef SOL_XDP
> +#define SOL_XDP 283
> +#endif
> +
> +COVERAGE_DEFINE(afxdp_cq_empty);
> +COVERAGE_DEFINE(afxdp_fq_full);
> +
> +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> +
> +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base))
> +#define UMEM2XPKT(base, i) \
> +                  ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \
> +                               i * sizeof(struct dp_packet_afxdp))
> +
> +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id,
> +                                             int mode);
> +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode);
> +static void xsk_destroy(struct xsk_socket_info *xsk);
> +static int xsk_configure_all(struct netdev *netdev);
> +static void xsk_destroy_all(struct netdev *netdev);
> +
> +static struct xsk_umem_info *
> +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode)
> +{
> +    struct xsk_umem_config uconfig OVS_UNUSED;
> +    struct xsk_umem_info *umem;
> +    int ret;
> +    int i;
> +
> +    umem = xcalloc(1, sizeof *umem);
> +    ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
> +                           NULL);
> +    if (ret) {
> +        VLOG_ERR("xsk_umem__create failed (%s) mode: %s",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV");
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    umem->buffer = buffer;
> +
> +    /* set-up umem pool */
> +    if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("umem_pool_init failed");
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct umem_elem *elem;
> +
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + i * FRAME_SIZE);
> +        umem_elem_push(&umem->mpool, elem);
> +    }
> +
> +    /* set-up metadata */
> +    if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) {
> +        VLOG_ERR("xpacket_pool_init failed");
> +        umem_pool_cleanup(&umem->mpool);
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free(umem);
> +        return NULL;
> +    }
> +
> +    VLOG_DBG("%s xpacket pool from %p to %p", __func__,
> +              umem->xpool.array,
> +              (char *)umem->xpool.array +
> +              NUM_FRAMES * sizeof(struct dp_packet_afxdp));
> +
> +    for (i = NUM_FRAMES - 1; i >= 0; i--) {
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        xpacket = UMEM2XPKT(umem->xpool.array, i);
> +        xpacket->mpool = &umem->mpool;
> +
> +        packet = &xpacket->packet;
> +        packet->source = DPBUF_AFXDP;
> +    }
> +
> +    return umem;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex,
> +                     uint32_t queue_id, int xdpmode)
> +{
> +    struct xsk_socket_config cfg;
> +    struct xsk_socket_info *xsk;
> +    char devname[IF_NAMESIZE];
> +    uint32_t idx = 0, prog_id;
> +    int ret;
> +    int i;
> +
> +    xsk = xcalloc(1, sizeof(*xsk));
> +    xsk->umem = umem;
> +    cfg.rx_size = CONS_NUM_DESCS;
> +    cfg.tx_size = PROD_NUM_DESCS;
> +    cfg.libbpf_flags = 0;
> +
> +    if (xdpmode == XDP_ZEROCOPY) {
> +        cfg.bind_flags = XDP_ZEROCOPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +    } else {
> +        cfg.bind_flags = XDP_COPY;
> +        cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +    }
> +
> +    if (if_indextoname(ifindex, devname) == NULL) {
> +        VLOG_ERR("ifindex %d to devname failed (%s)",
> +                 ifindex, ovs_strerror(errno));
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem,
> +                             &xsk->rx, &xsk->tx, &cfg);
> +    if (ret) {
> +        VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d",
> +                 ovs_strerror(errno),
> +                 xdpmode == XDP_COPY ? "SKB": "DRV",
> +                 queue_id);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    /* Make sure the built-in AF_XDP program is loaded */
> +    ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags);
> +    if (ret) {
> +        VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno));
> +        xsk_socket__delete(xsk->xsk);
> +        free(xsk);
> +        return NULL;
> +    }
> +
> +    while (!xsk_ring_prod__reserve(&xsk->umem->fq,
> +                                   PROD_NUM_DESCS, &idx)) {
> +        VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue");
> +    }
> +
> +    for (i = 0;
> +         i < PROD_NUM_DESCS * FRAME_SIZE;
> +         i += FRAME_SIZE) {
> +        struct umem_elem *elem;
> +        uint64_t addr;
> +
> +        elem = umem_elem_pop(&xsk->umem->mpool);
> +        addr = UMEM2DESC(elem, xsk->umem->buffer);
> +
> +        *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr;
> +    }
> +
> +    xsk_ring_prod__submit(&xsk->umem->fq,
> +                          PROD_NUM_DESCS);
> +    return xsk;
> +}
> +
> +static struct xsk_socket_info *
> +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode)
> +{
> +    struct xsk_socket_info *xsk;
> +    struct xsk_umem_info *umem;
> +    void *bufs;
> +
> +    /* umem memory region */
> +    bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE);
> +    memset(bufs, 0, NUM_FRAMES * FRAME_SIZE);
> +
> +    /* create AF_XDP socket */
> +    umem = xsk_configure_umem(bufs,
> +                              NUM_FRAMES * FRAME_SIZE,
> +                              xdpmode);
> +    if (!umem) {
> +        free_pagealign(bufs);
> +        return NULL;
> +    }
> +
> +    xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode);
> +    if (!xsk) {
> +        /* clean up umem and xpacket pool */
> +        if (xsk_umem__delete(umem->umem)) {
> +            VLOG_ERR("xsk_umem__delete failed");
> +        }
> +        free_pagealign(bufs);
> +        umem_pool_cleanup(&umem->mpool);
> +        xpacket_pool_cleanup(&umem->xpool);
> +        free(umem);
> +    }
> +    return xsk;
> +}
> +
> +static int
> +xsk_configure_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk_info;
> +    int i, ifindex, n_rxq;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    n_rxq = netdev_n_rxq(netdev);
> +    dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *));
> +
> +    /* configure each queue */
> +    for (i = 0; i < n_rxq; i++) {
> +        VLOG_INFO("%s configure queue %d mode %s", __func__, i,
> +                  dev->xdpmode == XDP_COPY ? "SKB" : "DRV");
> +        xsk_info = xsk_configure(ifindex, i, dev->xdpmode);
> +        if (!xsk_info) {
> +            VLOG_ERR("failed to create AF_XDP socket on queue %d", i);
> +            dev->xsks[i] = NULL;
> +            goto err;
> +        }
> +        dev->xsks[i] = xsk_info;
> +        xsk_info->rx_dropped = 0;
> +        xsk_info->tx_dropped = 0;
> +    }
> +
> +    return 0;
> +
> +err:
> +    xsk_destroy_all(netdev);
> +    return EINVAL;
> +}
> +
> +static void
> +xsk_destroy(struct xsk_socket_info *xsk_info)
> +{
> +    struct xsk_umem *umem;
> +
> +    xsk_socket__delete(xsk_info->xsk);
> +    xsk_info->xsk = NULL;
> +
> +    umem = xsk_info->umem->umem;
> +    if (xsk_umem__delete(umem)) {
> +        VLOG_ERR("xsk_umem__delete failed");
> +    }
> +
> +    /* free the packet buffer */
> +    free_pagealign(xsk_info->umem->buffer);
> +
> +    /* cleanup umem pool */
> +    umem_pool_cleanup(&xsk_info->umem->mpool);
> +
> +    /* cleanup metadata pool */
> +    xpacket_pool_cleanup(&xsk_info->umem->xpool);
> +
> +    free(xsk_info->umem);
> +    free(xsk_info);
> +}
> +
> +static void
> +xsk_destroy_all(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int i, ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    for (i = 0; i < netdev_n_rxq(netdev); i++) {
> +        if (dev->xsks && dev->xsks[i]) {
> +            VLOG_INFO("destroy xsk[%d]", i);
> +            xsk_destroy(dev->xsks[i]);
> +            dev->xsks[i] = NULL;
> +        }
> +    }
> +
> +    VLOG_INFO("remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +
> +    free(dev->xsks);
> +}
> +
> +static inline void OVS_UNUSED
> +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) {
> +    struct xdp_statistics stat;
> +    socklen_t optlen;
> +
> +    optlen = sizeof stat;
> +    ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS,
> +               &stat, &optlen) == 0);
> +
> +    VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu",
> +                stat.rx_dropped,
> +                stat.rx_invalid_descs,
> +                stat.tx_invalid_descs);
> +}
> +
> +int
> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args,
> +                        char **errp OVS_UNUSED)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    const char *str_xdpmode;
> +    int xdpmode, new_n_rxq;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> +    if (new_n_rxq > MAX_XSKQ) {
> +        ovs_mutex_unlock(&dev->mutex);
> +        VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).",
> +                 netdev_get_name(netdev), new_n_rxq, MAX_XSKQ);
> +        return EINVAL;
> +    }
> +
> +    str_xdpmode = smap_get_def(args, "xdpmode", "skb");
> +    if (!strcasecmp(str_xdpmode, "drv")) {
> +        xdpmode = XDP_ZEROCOPY;
> +    } else if (!strcasecmp(str_xdpmode, "skb")) {
> +        xdpmode = XDP_COPY;
> +    } else {
> +        VLOG_ERR("%s: Incorrect xdpmode (%s).",
> +                 netdev_get_name(netdev), str_xdpmode);
> +        ovs_mutex_unlock(&dev->mutex);
> +        return EINVAL;
> +    }
> +
> +    if (dev->requested_n_rxq != new_n_rxq
> +        || dev->requested_xdpmode != xdpmode) {
> +        dev->requested_n_rxq = new_n_rxq;
> +        dev->requested_xdpmode = xdpmode;
> +        netdev_request_reconfigure(netdev);
> +    }
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +int
> +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    smap_add_format(args, "n_rxq", "%d", netdev->n_rxq);
> +    smap_add_format(args, "xdpmode", "%s",
> +        dev->xdp_bind_flags == XDP_ZEROCOPY ? "drv" : "skb");
> +    ovs_mutex_unlock(&dev->mutex);
> +    return 0;
> +}
> +
> +static void
> +netdev_afxdp_alloc_txq(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int n_txqs = netdev_n_rxq(netdev);
> +    int i;
> +
> +    dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spinlock));
> +
> +    for (i = 0; i < n_txqs; i++) {
> +        ovs_spinlock_init(&dev->tx_locks[i]);
> +    }
> +}
> +
> +int
> +netdev_afxdp_reconfigure(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +    int err = 0;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +
> +    if (netdev->n_rxq == dev->requested_n_rxq
> +        && dev->xdpmode == dev->requested_xdpmode) {
> +        goto out;
> +    }
> +
> +    xsk_destroy_all(netdev);
> +    free(dev->tx_locks);
> +
> +    netdev->n_rxq = dev->requested_n_rxq;
> +    netdev_afxdp_alloc_txq(netdev);
> +
> +    if (dev->requested_xdpmode == XDP_ZEROCOPY) {
> +        VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev));
> +        /* From SKB mode to DRV mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        dev->xdp_bind_flags = XDP_ZEROCOPY;
> +        dev->xdpmode = XDP_ZEROCOPY;
> +
> +        if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +            VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s",
> +                      ovs_strerror(errno));
> +        }
> +    } else {
> +        VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev));
> +        /* From DRV mode to SKB mode */
> +        dev->xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        dev->xdp_bind_flags = XDP_COPY;
> +        dev->xdpmode = XDP_COPY;
> +        /* TODO: set rlimit back to previous value
> +         * when no device is in DRV mode.
> +         */
> +    }
> +
> +    err = xsk_configure_all(netdev);
> +    if (err) {
> +        VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev));
> +    }
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    return err;
> +}
> +
> +int
> +netdev_afxdp_get_numa_id(const struct netdev *netdev)
> +{
> +    /* FIXME: Get netdev's PCIe device ID, then find
> +     * its NUMA node id.
> +     */
> +    VLOG_INFO("FIXME: Device %s always use numa id 0",
> +              netdev_get_name(netdev));
> +    return 0;
> +}
> +
> +static void
> +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode)
> +{
> +    uint32_t prog_id = 0;
> +    uint32_t flags;
> +
> +    /* remove_xdp_program() */
> +    if (xdpmode == XDP_COPY) {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
> +        VLOG_INFO("%s copy mode", __func__);
> +    } else {
> +        flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
> +        VLOG_INFO("%s drv mode", __func__);
> +    }
> +
> +    if (bpf_get_link_xdp_id(ifindex, &prog_id, flags)) {
> +        VLOG_WARN("get xdp program id fails");
> +    }
> +    bpf_set_link_xdp_fd(ifindex, -1, XDP_FLAGS_UPDATE_IF_NOEXIST);
> +}
> +
> +void
> +signal_remove_xdp(struct netdev *netdev)
> +{
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    int ifindex;
> +
> +    ifindex = linux_get_ifindex(netdev_get_name(netdev));
> +
> +    VLOG_WARN("force remove xdp program");
> +    xsk_remove_xdp_program(ifindex, dev->xdpmode);
> +}
> +
> +static struct dp_packet_afxdp *
> +dp_packet_cast_afxdp(const struct dp_packet *d)
> +{
> +    ovs_assert(d->source == DPBUF_AFXDP);
> +    return CONTAINER_OF(d, struct dp_packet_afxdp, packet);
> +}
> +
> +static inline void
> +prepare_fill_queue(struct xsk_socket_info *xsk_info)
> +{
> +    struct umem_elem *elems[BATCH_SIZE];
> +    struct xsk_umem_info *umem;
> +    unsigned int idx_fq;
> +    int nb_free;
> +    int i, ret;
> +
> +    umem = xsk_info->umem;
> +
> +    nb_free = PROD_NUM_DESCS / 2;
> +    if (xsk_prod_nb_free(&umem->fq, nb_free) < nb_free) {
> +        return;
> +    }


Why you're using 'PROD_NUM_DESCS / 2' here?
IIUC, we're keeping fill queue half-loaded. Isn't it better to
use BATCH_SIZE instead?


> +
> +    ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> +    if (OVS_UNLIKELY(ret)) {
> +        return;
> +    }
> +
> +    if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) {
> +        umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems);
> +        COVERAGE_INC(afxdp_fq_full);
> +        return;
> +    }
> +
> +    for (i = 0; i < BATCH_SIZE; i++) {
> +        uint64_t index;
> +        struct umem_elem *elem;
> +
> +        elem = elems[i];
> +        index = (uint64_t)((char *)elem - (char *)umem->buffer);
> +        ovs_assert((index & FRAME_SHIFT_MASK) == 0);
> +        *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index;
> +
> +        idx_fq++;
> +    }
> +    xsk_ring_prod__submit(&umem->fq, BATCH_SIZE);
> +}
> +
> +int
> +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> +                      int *qfill)
> +{
> +    struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> +    struct netdev *netdev = rx->up.netdev;
> +    struct netdev_linux *dev = netdev_linux_cast(netdev);
> +    struct xsk_socket_info *xsk_info;
> +    struct xsk_umem_info *umem;
> +    uint32_t idx_rx = 0;
> +    int qid = rxq_->queue_id;
> +    unsigned int rcvd, i;
> +
> +    xsk_info = dev->xsks[qid];
> +    if (!xsk_info || !xsk_info->xsk) {
> +        return 0;

Need to return EAGAIN.

> +    }
> +
> +    prepare_fill_queue(xsk_info);
> +
> +    umem = xsk_info->umem;
> +    rx->fd = xsk_socket__fd(xsk_info->xsk);
> +
> +    rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx);
> +    if (!rcvd) {
> +        return 0;

Need to return EAGAIN.

> +    }
> +
> +    /* Setup a dp_packet batch from descriptors in RX queue */
> +    for (i = 0; i < rcvd; i++) {
> +        uint64_t addr = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->addr;
> +        uint32_t len = xsk_ring_cons__rx_desc(&xsk_info->rx, idx_rx)->len;
> +        char *pkt = xsk_umem__get_data(umem->buffer, addr);
> +        uint64_t index;
> +
> +        struct dp_packet_afxdp *xpacket;
> +        struct dp_packet *packet;
> +
> +        index = addr >> FRAME_SHIFT;
> +        xpacket = UMEM2XPKT(umem->xpool.array, index);
> +        packet = &xpacket->packet;
> +
> +        /* Initialize the struct dp_packet */
> +        dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM);
> +        dp_packet_set_size(packet, len);
> +
> +        /* Add packet into batch, increase batch->count */
> +        dp_packet_batch_add(batch, packet);
> +
> +        idx_rx++;
> +    }
> +    /* Release the RX queue */
> +    xsk_ring_cons__release(&xsk_info->rx, rcvd);
> +
> +    if (qfill) {
> +        /* TODO: return the number of remaining packets in the queue. */
> +        *qfill = 0;
> +    }
> +
> +#ifdef AFXDP_DEBUG
> +    log_xsk_stat(xsk_info);
> +#endif
> +    return 0;
> +}
> +
> +static inline int
> +kick_tx(struct xsk_socket_info *xsk_info)
> +{
> +    int ret;
> +
> +    if (!xsk_info->outstanding_tx) {
> +        return 0;
> +    }
> +
> +    /* This causes system call into kernel's xsk_sendmsg, and
> +     * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
> +     */
> +    ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
> +                                NULL, 0);
> +    if (OVS_UNLIKELY(ret < 0)) {
> +        if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
> +            return errno;
> +        }
> +    }
> +    /* no error, or EBUSY or EAGAIN */
> +    return 0;
> +}
> +
> +void
> +free_afxdp_buf(struct dp_packet *p)
> +{
> +    struct dp_packet_afxdp *xpacket;
> +    uintptr_t addr;
> +
> +    xpacket = dp_packet_cast_afxdp(p);
> +    if (xpacket->mpool) {
> +        void *base = dp_packet_base(p);
> +
> +        addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +        umem_elem_push(xpacket->mpool, (void *)addr);
> +    }
> +}
> +
> +static void
> +free_afxdp_buf_batch(struct dp_packet_batch *batch)
> +{
> +    struct dp_packet_afxdp *xpacket = NULL;
> +    struct dp_packet *packet;
> +    void *elems[BATCH_SIZE];
> +    uintptr_t addr;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (xpacket->mpool) {


Above checking seems useless. Also, if any packet will be
skipped, we'll push trash pointer to mpool.

If you're worrying about the value, you may just assert:

            ovs_assert(xpacket->mpool);

> +            void *base = dp_packet_base(packet);
> +
> +            addr = (uintptr_t)base & (~FRAME_SHIFT_MASK);
> +            elems[i] = (void *)addr;
> +        }
> +    }
> +    umem_elem_push_n(xpacket->mpool, batch->count, elems);
> +    dp_packet_batch_init(batch);
> +}
> +
> +static inline bool
> +check_free_batch(struct dp_packet_batch *batch)
> +{
> +    struct umem_pool *first_mpool = NULL;
> +    struct dp_packet_afxdp *xpacket;
> +    struct dp_packet *packet;
> +
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> +        if (packet->source != DPBUF_AFXDP) {
> +            return false;
> +        }
> +        xpacket = dp_packet_cast_afxdp(packet);
> +        if (i == 0) {
> +            first_mpool = xpacket->mpool;
> +            continue;
> +        }
> +        if (xpacket->mpool != first_mpool) {
> +            return false;
> +        }
> +    }
> +    /* All packets are DPBUF_AFXDP and from the same mpool */
> +    return true;
> +}
> +
> +static inline void
> +afxdp_complete_tx(struct xsk_socket_info *xsk_info)
> +{
> +    struct umem_elem *elems_push[BATCH_SIZE];
> +    struct xsk_umem_info *umem;
> +    uint32_t idx_cq = 0;
> +    int tx_to_free = 0;
> +    int tx_done, j;
> +
> +    umem = xsk_info->umem;
> +    tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq);
> +
> +    /* Recycle back to umem pool */
> +    for (j = 0; j < tx_done; j++) {
> +        struct umem_elem *elem;
> +        uint64_t *addr;
> +
> +        addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
> +        if (*addr == 0) {

'addr' is an offset from 'umem->buffer'. Zero seems a valid value.
Maybe it's better to use UINT64_MAX instead?

> +            /* The elem has been pushed already */
> +            continue;
> +        }
> +        elem = ALIGNED_CAST(struct umem_elem *,
> +                            (char *)umem->buffer + *addr);
> +        elems_push[tx_to_free] = elem;
> +        *addr = 0; /* Mark as pushed */
> +        tx_to_free++;
> +    }
> +
> +    umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
> +
> +    if (tx_done > 0) {
> +        xsk_ring_cons__release(&umem->cq, tx_done);
> +        xsk_info->outstanding_tx -= tx_done;

We, probably, should substract the 'tx_to_free' instead and do this
outside of the 'if'.

> +    } else {
> +        COVERAGE_INC(afxdp_cq_empty);
> +    }
> +}


More information about the dev mailing list