[ovs-dev] [PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.
William Tu
u9012063 at gmail.com
Sat May 18 01:16:54 UTC 2019
Hi Eelco,
Thanks for all the feedbacks. There are some issues in driver, some
in libbpf, and some in my implementation. I will work on it ASAP.
On Fri, May 17, 2019 at 3:23 AM Eelco Chaudron <echaudro at redhat.com> wrote:
>
> Hi William,
>
> First a list of issues I found during some basic testing...
>
> - When I restart or stop OVS (using the systemctl interface as found in
> RHEL) it does not clean up the BFP program causing the restart to fail:
>
> 2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1
> reconfig fails
> 2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set
> interface eno1 new configuration
>
> I need to manually run "ip link set dev eno1 xdp off" to make it
> recover.
I think this is a bug in libbpf, see
[PATCH bpf 0/2] libbpf: fixes for AF_XDP teardown
>
>
> - When I remove a bridge, I get an emer in the revalidator:
>
> 2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
> 2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111:
> assertion !fd != !wevent failed in poll_create_node()
>
> Easy to replicate with this:
>
> $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0
> datapath_type=netdev
> $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1
> type="afxdp" options:xdpmode=drv
> $ ovs-vsctl del-br ovs_pvp_br0
>
Thanks I can reproduce it. Will make sure to fix it .
>
> - High pmd usage on the statistics, even with no packets is this
> expected?
>
> $ ovs-appctl dpif-netdev/pmd-rxq-show
> pmd thread numa_id 0 core_id 1:
> isolated : false
> port: dpdk0 queue-id: 0 pmd usage: 0 %
> port: eno1 queue-id: 0 pmd usage: 49 %
>
> It goes up slowly and gets stuck at 49%
>
>
> - When doing the PVP testing I noticed that the physical port has odd/no
> tx statistics:
>
> $ ovs-ofctl dump-ports ovs_pvp_br0
> OFPST_PORT reply (xid=0x2): 3 ports
> port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0,
> crc=0
> tx pkts=0, bytes=0, drop=0, errs=0, coll=0
> port eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0,
> frame=0, over=0, crc=0
> tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
> port tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0,
> over=0, crc=0
> tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
>
I think the ixgbe driver has some issue.
If you run skb-mode, I think the stats are correct.
See patch [1/2] ixgbe: fix AF_XDP tx byte count
>
> - Packets larger than 1028 bytes are dropped. Guess this needs to be
> fixed, and we need to state that jumbo frames are not supported. Are you
> planning on adding this?
>
> Currently I can find not mentioning of MTU limitation in the
> documentation, or any code to prevent it from being changed above the
> supported limit.
>
>
> - ovs-vswitchd is still crashing or stops forwarding packets when trying
> to do
> PVP testing with Qemu that has a TAP interface doing XDP and running
> packets
> at wire speed to the 10G interface.
>
> When trying with lower volume packets it seems to work, so with 1%
> traffic
> rate, it forwards packets without any problems (148,771 pps). If I go
> to
> 10% the first couple of packet pass, then it stops forwarding. If
> it's not
> crashing I still see packets being received by eno1 flow rules, but
> no
> packets make it to the VM.
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
> 654 ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count,
> (void **)elems_pop);
> [Current thread is 1 (Thread 0x7fc95e734700 (LWP 3926))]
> Missing separate debuginfos, use: dnf debuginfo-install
> openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64
> (gdb) bt
> #0 0x00000000009b2505 in netdev_linux_afxdp_batch_send (xsk=0x0,
> batch=batch at entry=0x7fc928005570) at lib/netdev-afxdp.c:654
> #1 0x00000000009a1850 in netdev_linux_send (netdev_=0x2f7f540,
> qid=<optimized out>, batch=0x7fc928005570, concurrent_txq=<optimized
> out>) at lib/netdev-linux.c:1486
> #2 0x0000000000906051 in netdev_send (netdev=<optimized out>,
> qid=qid at entry=0, batch=batch at entry=0x7fc928005570,
> concurrent_txq=concurrent_txq at entry=true)
> at lib/netdev.c:797
> #3 0x00000000008d2c94 in dp_netdev_pmd_flush_output_on_port
> (pmd=pmd at entry=0x7fc95e735010, p=p at entry=0x7fc928005540) at
> lib/dpif-netdev.c:4185
> #4 0x00000000008d2faf in dp_netdev_pmd_flush_output_packets
> (pmd=pmd at entry=0x7fc95e735010, force=force at entry=false) at
> lib/dpif-netdev.c:4225
> #5 0x00000000008db317 in dp_netdev_pmd_flush_output_packets
> (force=false, pmd=0x7fc95e735010) at lib/dpif-netdev.c:4280
> #6 dp_netdev_process_rxq_port (pmd=pmd at entry=0x7fc95e735010,
> rxq=0x2f36c50, port_no=1) at lib/dpif-netdev.c:4280
> #7 0x00000000008db67d in pmd_thread_main (f_=<optimized out>) at
> lib/dpif-netdev.c:5446
> #8 0x000000000095c96d in ovsthread_wrapper (aux_=<optimized out>)
> at lib/ovs-thread.c:352
> #9 0x00007fc9789d62de in start_thread () from
> /lib64/libpthread.so.0
> #10 0x00007fc97817ba63 in clone () from /lib64/libc.so.6
>
>
> - make check-afxpd is failing for me, however, make check-kernel works
> fine.
> Did not dive into it too much, but it fails here for all test cases,
> this is the same build I use for testing.
>
> ./system-afxdp-traffic.at:4: ovs-vsctl -- add-br br0 -- set Bridge
> br0 datapath_type=netdev
> protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
> fail-mode=secure --
> --- /dev/null 2019-05-16 09:09:33.445562692 -0400
> +++
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/at-groups/1/stderr 2019-05-17
> 05:46:20.506814939 -0400
> @@ -0,0 +1,2 @@
> +ovs-vsctl: Error detected while setting up 'br0'. See ovs-vswitchd
> log for details.
> +ovs-vsctl: The default log directory is
> "/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01".
> ovsdb-server.log:
> > 2019-05-17T09:46:20.437Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovsdb-server.log
> > 2019-05-17T09:46:20.441Z|00002|ovsdb_server|INFO|ovsdb-server (Open
> vSwitch) 2.11.90
> ovs-vswitchd.log:
> > 2019-05-17T09:46:20.461Z|00001|vlog|INFO|opened log file
> /root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/ovs-vswitchd.log
> > 2019-05-17T09:46:20.462Z|00002|ovs_numa|INFO|Discovered 28 CPU cores
> on NUMA node 0
> > 2019-05-17T09:46:20.462Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes
> and 28 CPU cores
> >
> 2019-05-17T09:46:20.462Z|00004|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connecting...
> >
> 2019-05-17T09:46:20.462Z|00005|reconnect|INFO|unix:/root/home/OVS_master_DPDK_v18.11/ovs_github/tests/system-afxdp-testsuite.dir/01/db.sock:
> connected
> > 2019-05-17T09:46:20.465Z|00006|bridge|INFO|ovs-vswitchd (Open
> vSwitch) 2.11.90
> > 2019-05-17T09:46:20.505Z|00007|netdev_linux|WARN|ovs-netdev:
> creating tap device failed: Device or resource busy
I think the tap device is not cleared from previous settings.
Or do
#ip link del dev ovs-netdev
then check again.
> > 2019-05-17T09:46:20.508Z|00008|dpif|WARN|datapath ovs-netdev already
> exists but cannot be opened: No such device
> > 2019-05-17T09:46:20.508Z|00009|ofproto_dpif|ERR|failed to open
> datapath of type netdev: No such device
> > 2019-05-17T09:46:20.508Z|00010|ofproto|ERR|failed to open datapath
> br0: No such device
> > 2019-05-17T09:46:20.508Z|00011|bridge|ERR|failed to create bridge
> br0: No such device
> 1. system-afxdp-traffic.at:3: FAILED (system-afxdp-traffic.at:4)
>
>
>
>
> The following might be useful when combining DPDK and AF_XDP:
>
> Currently, DPDK and AF_XDP polling can be combined on a single PMD
> thread, it
> might be nice to have an option to not do this, i.e. have separate
> PMD
> threads for each type. I know we can do this with assigning specific
> PMDs to
> queues, but this will disable auto-balancing. This will also help
> later if
> we would add poll() mode support for AF_XDP.
>
>
> Other review comments see inline below. I reviewed the code, not the
> unit tests or automake changes.
>
<snip>
> > +.. note::
> > + OVS AF_XDP netdev is using the userspace datapath, the same
> > datapath
> > + as used by OVS-DPDK. So it requires --disable-system for
> > ovs-vswitchd
> > + and datapath_type=netdev when adding a new bridge.
>
> As mentioned earlier offline I think --disable-system can be removed as
> the Kernel and userspace datapath can be run at the same time.
>
Yes, thanks
> > +
> > +Make sure your device driver support AF_XDP, and to use 1 PMD (on
> > core 4)
> > +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask,
> > +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or
> > "skb"::
>
> Wondering how options:xdpmode should operate without it being specified?
> I would prefer that if the option is not specified it would try drv, and
> if it fails fallback to skb.
>
by default it is using skb mode. I prefer skb-mode as default since
it has less driver-related issues.
> We need to add these new options to the vswitch.xml file
OK
>
> > +
> > + ethtool -L enp2s0 combined 1
> > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10
> > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > + options:n_rxq=1 options:xdpmode=drv \
> > + other_config:pmd-rxq-affinity="0:4"
> > +
> > +Or, use 4 pmds/cores and 4 queues by doing::
> > +
> > + ethtool -L enp2s0 combined 4
> > + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36
> > + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp"
> > \
> > + options:n_rxq=4 options:xdpmode=drv \
> > + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4"
> > +
>
> Add some text that pmd-rxq-affinity is not a requirement, the system
> will auto (re)assign.
> Also, note that cores used by pmd-rxq-affinity are not shared/used by
> floating PMDs.
Good point, thanks
>
> > +To validate that the bridge has successfully instantiated, you can
> > use the::
> > +
> > + ovs-vsctl show
> > +
> > +should show something like::
> > +
> > + Port "ens802f0"
> > + Interface "ens802f0"
> > + type: afxdp
> > + options: {n_rxq="1", xdpmode=drv}
> > +
> > +Otherwise, enable debug by::
> > +
> > + ovs-appctl vlog/set netdev_afxdp::dbg
> > +
> > +References
> > +----------
> > +Most of the design details are described in the paper presented at
> > +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1],
> > +section 4, and slides[2][4].
> > +"The Path to DPDK Speeds for AF XDP"[3] gives a very good
> > introduction
> > +about AF_XDP current and future work.
> > +
> > +
> > +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf
> > +
> > +[2]
> > http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf
> > +
> > +[3]
> > http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf
> > +
> > +[4]
> > https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp
> > +
> > +
> > +Performance Tuning
> > +------------------
> > +The name of the game is to keep your CPU running in userspace,
> > allowing PMD
> > +to keep polling the AF_XDP queues without any interferences from
> > kernel.
> > +
> > +#. Make sure everything is in the same NUMA node (memory used by
> > AF_XDP, pmd
> > + running cores, device plug-in slot)
>
> How can you do this? The code is not taking care of NUMA, and memory is
> allocated with posix_memalign so no idea which NUMA node it gets
> allocated.
right... I'm hoping that users can be aware of this and run ovs-vswitchd
using taskset -p <cpu mask>
So running the process on the correct NUMA node.
>
> > +#. Isolate your CPU by doing isolcpu at grub configure.
> > +
> > +#. IRQ should not set to pmd running core.
> > +
> > +#. The Spectre and Meltdown fixes increase the overhead of system
> > calls.
> > +
>
> Maybe be more consistent, either one or two newlines before a heading?
OK
<snip>
> > +
> > +Create OpenFlow rules::
> > +
> > + ovs-vsctl add-port br0 tap0
>
> Maybe add tap as XDP or else it will be an AF_PACKET interface polling
> in the main thread.
I think it should work (XDP on tap). Let me try.
What's your concern here about polling AF_PACKET?
>
> > + ovs-ofctl del-flows br0
> > + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0"
> > + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0"
> > +
> > +
> > +Attach the veth port to br0 (linux kernel mode)::
> > +
> > + ovs-vsctl add-port br0 afxdp-p0 -- \
> > + set interface afxdp-p0 options:n_rxq=1 options:xdpmode=skb
> > +
>
> Remove the xdpmode=skb above... Also, see above on the PF_PACKET
> interface in the bridge_run(),
> I would advise against using this, and you might want to remove it.
OK
>
> > +
> > +Or, use AF_XDP with skb mode::
> > +
> > + ovs-vsctl add-port br0 afxdp-p0 -- \
> > + set interface afxdp-p0 type="afxdp" options:n_rxq=1
> > options:xdpmode=skb
> > +
> > +Setup the OpenFlow rules::
> > +
<snip>
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index 0976a35e758b..7d086dc5e860 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -22,6 +22,9 @@
> > #include "netdev-dpdk.h"
> > #include "openvswitch/dynamic-string.h"
> > #include "util.h"
> > +#ifdef HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> Why the protection above? You do not do this in netdev-linux.c.
> Maybe you should move the #ifdef HAVE_AF_XDP inside the include file?
>
OK will fix it
> > static void
> > dp_packet_init__(struct dp_packet *b, size_t allocated, enum
> > dp_packet_source source)
> > @@ -59,6 +62,27 @@ dp_packet_use(struct dp_packet *b, void *base,
> > size_t allocated)
> > dp_packet_use__(b, base, allocated, DPBUF_MALLOC);
> > }
> >
<snip>
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.c
> > @@ -0,0 +1,727 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#if !defined(__i386__) && !defined(__x86_64__)
> > +#error AF_XDP supported only for Linux on x86 or x86_64
>
> Any reason why we do not support PPC and ARM?
>
> > +#endif
> > +
> > +#include <config.h>
> > +
> > +#include "netdev-linux-private.h"
> > +#include "netdev-linux.h"
>
> Swap the two above, see comment in netdev-linux-private.h
>
> > +#include "netdev-afxdp.h"
> > +
> > +#include <arpa/inet.h>
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <inttypes.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <linux/rtnetlink.h>
> > +#include <linux/sockios.h>
> > +#include <linux/if_xdp.h>
> > +#include <net/if.h>
> > +#include <net/if_arp.h>
> > +#include <net/route.h>
> > +#include <netinet/in.h>
> > +#include <netpacket/packet.h>
> > +#include <poll.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/utsname.h>
> > +#include <unistd.h>
> > +
>
> Some of these includes are included by netdev-linux(-private).h already
> so why not remove them?
OK will remove them.
>
> > +#include "coverage.h"
> > +#include "dp-packet.h"
> > +#include "dpif-netlink.h"
> > +#include "dpif-netdev.h"
> > +#include "fatal-signal.h"
> > +#include "hash.h"
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "netlink-notifier.h"
> > +#include "netlink-socket.h"
> > +#include "netlink.h"
> > +#include "netnsid.h"
> > +#include "openflow/openflow.h"
> > +#include "openvswitch/dynamic-string.h"
> > +#include "openvswitch/hmap.h"
> > +#include "openvswitch/ofpbuf.h"
> > +#include "openvswitch/poll-loop.h"
> > +#include "openvswitch/vlog.h"
> > +#include "openvswitch/shash.h"
> > +#include "ovs-atomic.h"
> > +#include "packets.h"
> > +#include "rtnetlink.h"
> > +#include "socket-util.h"
> > +#include "sset.h"
> > +#include "tc.h"
> > +#include "timer.h"
> > +#include "unaligned.h"
> > +#include "util.h"
> > +#include "xdpsock.h"
> > +
> > +#ifndef SOL_XDP
> > +#define SOL_XDP 283
> > +#endif
> > +#ifndef AF_XDP
> > +#define AF_XDP 44
> > +#endif
> > +#ifndef PF_XDP
> > +#define PF_XDP AF_XDP
> > +#endif
>
> Do we really need to include the above? Or should we update the install
> instruction to move them over from the kernel headers?
>
I think we only need SOL_XDP.
> > +
> > +VLOG_DEFINE_THIS_MODULE(netdev_afxdp);
> > +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > +
> > +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char
> > *)base))
> > +#define UMEM2XPKT(base, i) \
> > + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base
> > + \
> > + i * sizeof(struct dp_packet_afxdp))
> > +
<snip>
Some comments here about umem and queue are discussed later with Ilya.
Will address them together.
> > +int
> > +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > + struct dp_packet_batch *batch)
> > +{
>
> See Ilya's comment on thread safety on the ring APIs.
>
> > + struct umem_elem *elems_pop[BATCH_SIZE];
> > + struct umem_elem *elems_push[BATCH_SIZE];
> > + uint32_t tx_done, idx_cq = 0;
> > + struct dp_packet *packet;
> > + uint32_t idx = 0;
> > + int j, ret, retry_count = 0;
> > + const int max_retry = 4;
> > +
> > + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > + if (OVS_UNLIKELY(ret)) {
> > + return EAGAIN;
> > + }
> > +
> > + /* Make sure we have enough TX descs */
> > + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx);
> > + if (OVS_UNLIKELY(ret == 0)) {
> > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > + return EAGAIN;
> > + }
> > +
> > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > + struct umem_elem *elem;
> > + uint64_t index;
> > +
> > + elem = elems_pop[i];
> > + /* Copy the packet to the umem we just pop from umem pool.
> > + * We can avoid this copy if the packet and the pop umem
> > + * are located in the same umem.
> > + */
>
> The comment mentions the copy can be avoided, but it's not implemented
> in the code, is this correct or was something removed?
>
> > + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet));
it's not implemented yet. now it's always making a copy
> > +
> > + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer);
> > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index;
> > + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len
> > + = dp_packet_size(packet);
> > + }
> > + xsk_ring_prod__submit(&xsk->tx, batch->count);
> > + xsk->outstanding_tx += batch->count;
> > +
> > + ret = kick_tx(xsk);
> > + if (OVS_UNLIKELY(ret)) {
> > + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void
> > **)elems_pop);
> > + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
> > + ovs_strerror(ret));
> > + return ret;
>
> I think we should still try to recover the CQ below, even on failure.
>
> > + }
> > +
> > +retry:
> > + /* Process CQ */
> > + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count,
> > &idx_cq);
> > + if (tx_done > 0) {
> > + xsk->outstanding_tx -= tx_done;
> > + xsk->tx_npkts += tx_done;
> > + }
> > +
> > + /* Recycle back to umem pool */
> > + for (j = 0; j < tx_done; j++) {
> > + struct umem_elem *elem;
> > + uint64_t addr;
> > +
> > + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++);
> > +
> > + elem = ALIGNED_CAST(struct umem_elem *,
> > + (char *)xsk->umem->buffer + addr);
> > + elems_push[j] = elem;
> > + }
> > +
> > + ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void
> > **)elems_push);
> > + ovs_assert(ret == 0);
> > +
> > + xsk_ring_cons__release(&xsk->umem->cq, tx_done);
> > +
> > + if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2))
> > {
> > + /* If there are still a lot not transmitted, try harder. */
> > + if (retry_count++ > max_retry) {
> > + return 0;
> > + }
> > + goto retry;
> > + }
> > +
>
> I think the code above is causing my lockup at wire speed mentioned
> above...
> I guess the retry_count expires every transmit sending packets to the
> TAP interface.
> No all buffers are used... This is causing the umem_elem_pop_n() in the
> beginning to fail, hence the buffers are never returned!
>
> Guess we might need some reclaim in the beginning, or maybe even in the
> rx loop?
right, let me re-work this part of code.
> > + return 0;
> > +}
> > diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h
> > new file mode 100644
> > index 000000000000..6518d8fca0b5
> > --- /dev/null
> > +++ b/lib/netdev-afxdp.h
> > @@ -0,0 +1,53 @@
> > +/*
> > + * Copyright (c) 2018 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_AFXDP_H
> > +#define NETDEV_AFXDP_H 1
> > +
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +/* These functions are Linux AF_XDP specific, so they should be used
> > directly
> > + * only by Linux-specific code. */
>
> Extra enter?
>
> > +#define MAX_XSKQ 16
>
> Extra enter?
>
OK
> > +struct netdev;
> > +struct xsk_socket_info;
> > +struct xdp_umem;
> > +struct dp_packet_batch;
> > +struct smap;
> > +struct dp_packet;
> > +
> > +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet
> > *d);
> > +
> > +int xsk_configure_all(struct netdev *netdev);
> > +
> > +void xsk_destroy_all(struct netdev *netdev);
> > +
> > +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk,
> > + struct dp_packet_batch *batch);
> > +
> > +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk,
> > + struct dp_packet_batch *batch);
> > +
> > +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap
> > *args,
> > + char **errp);
> > +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap
> > *args);
> > +int netdev_afxdp_get_numa_id(const struct netdev *netdev);
> > +
> > +void free_afxdp_buf(struct dp_packet *p);
> > +void free_afxdp_buf_batch(struct dp_packet_batch *batch);
> > +int netdev_afxdp_reconfigure(struct netdev *netdev);
> > +#endif /* netdev-afxdp.h */
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > new file mode 100644
> > index 000000000000..3dd3d902b3c4
> > --- /dev/null
> > +++ b/lib/netdev-linux-private.h
> > @@ -0,0 +1,124 @@
> > +/*
> > + * Copyright (c) 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef NETDEV_LINUX_PRIVATE_H
> > +#define NETDEV_LINUX_PRIVATE_H 1
> > +
> > +#include <config.h>
> > +
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Why include all the above? They where just added to netdev-linux.h, so
> if you make sure you include netdev-lunux.h before -private it should
> work out.
>
> > +
> > +#if HAVE_AF_XDP
> > +#include "netdev-afxdp.h"
> > +#endif
>
> See earlier comment
>
> > +
> > +/* These functions are Linux specific, so they should be used
> > directly only by
> > + * Linux-specific code. */
> > +
> > +struct netdev;
> > +
> > +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t
> > flag,
> > + const char *flag_name, bool
> > enable);
> > +int linux_get_ifindex(const char *netdev_name);
> > +
>
> These functions are now both specified in netdev-linux.h and
> netdev-linux-private.h
>
> > +#define LINUX_FLOW_OFFLOAD_API \
> > + .flow_flush = netdev_tc_flow_flush, \
> > + .flow_dump_create = netdev_tc_flow_dump_create, \
> > + .flow_dump_destroy = netdev_tc_flow_dump_destroy, \
> > + .flow_dump_next = netdev_tc_flow_dump_next, \
> > + .flow_put = netdev_tc_flow_put, \
> > + .flow_get = netdev_tc_flow_get, \
> > + .flow_del = netdev_tc_flow_del, \
> > + .init_flow_api = netdev_tc_init_flow_api
> > +
>
> Same here, this define is in both include files.
>
> > +struct netdev_linux {
> > + struct netdev up;
> > +
> > + /* Protects all members below. */
> > + struct ovs_mutex mutex;
> > +
> > + unsigned int cache_valid;
> > +
> > + bool miimon; /* Link status of last poll. */
> > + long long int miimon_interval; /* Miimon Poll rate. Disabled if
> > <= 0. */
> > + struct timer miimon_timer;
> > +
> > + int netnsid; /* Network namespace ID. */
> > + /* The following are figured out "on demand" only. They are only
> > valid
> > + * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > + int ifindex;
> > + struct eth_addr etheraddr;
> > + int mtu;
> > + unsigned int ifi_flags;
> > + long long int carrier_resets;
> > + uint32_t kbits_rate; /* Policing data. */
> > + uint32_t kbits_burst;
> > + int vport_stats_error; /* Cached error code from
> > vport_get_stats().
> > + 0 or an errno value. */
> > + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU
> > + * or SIOCSIFMTU.
> > + */
> > + int ether_addr_error; /* Cached error code from set/get
> > etheraddr. */
> > + int netdev_policing_error; /* Cached error code from set
> > policing. */
> > + int get_features_error; /* Cached error code from
> > ETHTOOL_GSET. */
> > + int get_ifindex_error; /* Cached error code from
> > SIOCGIFINDEX. */
> > +
> > + enum netdev_features current; /* Cached from ETHTOOL_GSET. */
> > + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */
> > +
> > + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO.
> > */
> > + struct tc *tc;
> > +
> > + /* For devices of class netdev_tap_class only. */
> > + int tap_fd;
> > + bool present; /* If the device is present in the
> > namespace */
> > + uint64_t tx_dropped; /* tap device can drop if the iface
> > is down */
> > +
> > + /* LAG information. */
> > + bool is_lag_master; /* True if the netdev is a LAG
> > master. */
> > +
> > + /* AF_XDP information */
> > +#ifdef HAVE_AF_XDP
> > + struct xsk_socket_info *xsk[MAX_XSKQ];
> > + int requested_n_rxq;
> > + int xdpmode, requested_xdpmode; /* detect mode changed */
> > + int xdp_flags, xdp_bind_flags;
> > +#endif
> > +};
> > +
> > +static struct netdev_linux *
> > +netdev_linux_cast(const struct netdev *netdev)
> > +{
>
> In the original definition there was an assert() here, was it removed by
> accident?
> netdev_linux_rxq_xsk
> > + return CONTAINER_OF(netdev, struct netdev_linux, up);
> > +}
> > +
> > +#endif /* netdev-linux-private.h */
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index f75d73fd39f8..1f190406d145 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -17,6 +17,7 @@
> > #include <config.h>
> >
> > #include "netdev-linux.h"
> > +#include "netdev-linux-private.h"
> >
> > #include <errno.h>
> > #include <fcntl.h>
> > @@ -54,6 +55,7 @@
> > #include "fatal-signal.h"
> > #include "hash.h"
> > #include "openvswitch/hmap.h"
> > +#include "netdev-afxdp.h"
> > #include "netdev-provider.h"
> > #include "netdev-tc-offloads.h"
> > #include "netdev-vport.h"
> > @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu);
> > static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int
> > mtu);
> > static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t
> > burst_bytes);
> >
> > -struct netdev_linux {
> > - struct netdev up;
> > -
> > - /* Protects all members below. */
> > - struct ovs_mutex mutex;
> > -
> > - unsigned int cache_valid;
> > -
> > - bool miimon; /* Link status of last poll. */
> > - long long int miimon_interval; /* Miimon Poll rate. Disabled if
> > <= 0. */
> > - struct timer miimon_timer;
> > -
> > - int netnsid; /* Network namespace ID. */
> > - /* The following are figured out "on demand" only. They are only
> > valid
> > - * when the corresponding VALID_* bit in 'cache_valid' is set. */
> > - int ifindex;
> > - struct eth_addr etheraddr;
> > - int mtu;
> > - unsigned int ifi_flags;
> > - long long int carrier_resets;
> > - uint32_t kbits_rate; /* Policing data. */
> > - uint32_t kbits_burst;
> > - int vport_stats_error; /* Cached error code from
> > vport_get_stats().
> > - 0 or an errno value. */
> > - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU
> > or SIOCSIFMTU. */
> > - int ether_addr_error; /* Cached error code from set/get
> > etheraddr. */
> > - int netdev_policing_error; /* Cached error code from set
> > policing. */
> > - int get_features_error; /* Cached error code from
> > ETHTOOL_GSET. */
> > - int get_ifindex_error; /* Cached error code from
> > SIOCGIFINDEX. */
> > -
> > - enum netdev_features current; /* Cached from ETHTOOL_GSET. */
> > - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
> > - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */
> > -
> > - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO.
> > */
> > - struct tc *tc;
> > -
> > - /* For devices of class netdev_tap_class only. */
> > - int tap_fd;
> > - bool present; /* If the device is present in the
> > namespace */
> > - uint64_t tx_dropped; /* tap device can drop if the iface
> > is down */
> > -
> > - /* LAG information. */
> > - bool is_lag_master; /* True if the netdev is a LAG
> > master. */
> > -};
> >
> > struct netdev_rxq_linux {
> > struct netdev_rxq up;
> > @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class
> > *netdev_class)
> > return netdev_class->run == netdev_linux_run;
> > }
> >
> > +#if HAVE_AF_XDP
> > static bool
> > -is_tap_netdev(const struct netdev *netdev)
> > +is_afxdp_netdev(const struct netdev *netdev)
> > {
> > - return netdev_get_class(netdev) == &netdev_tap_class;
> > + return netdev_get_class(netdev) == &netdev_afxdp_class;
> > }
> > -
> > -static struct netdev_linux *
> > -netdev_linux_cast(const struct netdev *netdev)
> > +#else
> > +static bool
> > +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED)
> > {
> > - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev)));
> > -
> > - return CONTAINER_OF(netdev, struct netdev_linux, up);
> > + return false;
> > +}
> > +#endif
> > +static bool
> > +is_tap_netdev(const struct netdev *netdev)
> > +{
> > + return netdev_get_class(netdev) == &netdev_tap_class;
> > }
> >
> > static struct netdev_rxq_linux *
> > @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_)
> > atomic_count_dec(&miimon_cnt);
> > }
> >
> > +#if HAVE_AF_XDP
> > + if (is_afxdp_netdev(netdev_)) {
> > + xsk_destroy_all(netdev_);
> > + }
> > +#endif
>
> Think you can remove the HAVE_AF_XDP here, as you do not use it below
> either.
Yes
>
> > ovs_mutex_destroy(&netdev->mutex);
> > }
> >
> > @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq
> > *rxq_)
> > rx->is_tap = is_tap_netdev(netdev_);
> > if (rx->is_tap) {
> > rx->fd = netdev->tap_fd;
> > - } else {
> > + } else if (!is_afxdp_netdev(netdev_)) {
> > struct sockaddr_ll sll;
> > int ifindex, val;
> > /* Result of tcpdump -dd inbound */
> > @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> > {
> > struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> > struct netdev *netdev = rx->up.netdev;
> > - struct dp_packet *buffer;
> > + struct dp_packet *buffer = NULL;
> > ssize_t retval;
> > int mtu;
> >
> > +#if HAVE_AF_XDP
>
> Think this #if HAVE_AF_XDP can be removed as the compiler should
> optimize out the if (false).
>
> > + if (is_afxdp_netdev(netdev)) {
> > + struct netdev_linux *dev = netdev_linux_cast(netdev);
> > + int qid = rxq_->queue_id;
> > +
> > + return netdev_linux_rxq_xsk(dev->xsk[qid], batch);
> > + }
> > +#endif
> > if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) {
> > mtu = ETH_PAYLOAD_MAX;
> > }
> > @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_,
> > struct dp_packet_batch *batch,
> > /* Assume Ethernet port. No need to set packet_type. */
> > buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> > DP_NETDEV_HEADROOM);
> > +
> > retval = (rx->is_tap
> > ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> > : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> > int error = 0;
> > int sock = 0;
> >
> > - if (!is_tap_netdev(netdev_)) {
> > + if (!is_tap_netdev(netdev_) &&
> > + !is_afxdp_netdev(netdev_)) {
> > if
> > (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> > error = EOPNOTSUPP;
> > goto free_batch;
> > @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int
> > qid OVS_UNUSED,
> > }
> >
> > error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +#if HAVE_AF_XDP
>
> Same here remove the #if HAVE_AF_XDP
>
> > + } else if (is_afxdp_netdev(netdev_)) {
> > + struct netdev_linux *dev = netdev_linux_cast(netdev_);
> > + struct dp_packet_afxdp *xpacket;
> > + struct umem_pool *first_mpool;
> > + struct dp_packet *packet;
> > +
> > + error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch);
> > +
> > + /* all packets must come frome the same umem pool
> > + * and has DPBUF_AFXDP type, otherwise free on-by-one
> > + */
> > + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > + if (packet->source != DPBUF_AFXDP) {
> > + goto free_batch;
> > + }
> > +
> > + xpacket = dp_packet_cast_afxdp(packet);
> > + if (i == 0) {
> > + first_mpool = xpacket->mpool;
> > + continue;
> > + }
> > + if (xpacket->mpool != first_mpool) {
> > + goto free_batch;
> > + }
> > + }
>
> Why do not we not move all the packet type checks to
> free_afxdp_buf_batch()?
Here I plan to move them into the afxdp-specific send function.
So it won't be part of netdev_linux_send(), hopefully it's more clean.
>
> > + /* free in batch */
> > + free_afxdp_buf_batch(batch);
> > + return error;
> > +#endif
> > } else {
> > error = netdev_linux_tap_batch_send(netdev_, batch);
> > }
> > @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = {
> > NETDEV_LINUX_CLASS_COMMON,
> > LINUX_FLOW_OFFLOAD_API,
> > .type = "system",
> > + .is_pmd = false,
> > .construct = netdev_linux_construct,
> > .get_stats = netdev_linux_get_stats,
> > .get_features = netdev_linux_get_features,
> > @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = {
> > const struct netdev_class netdev_tap_class = {
> > NETDEV_LINUX_CLASS_COMMON,
> > .type = "tap",
> > + .is_pmd = false,
> > .construct = netdev_linux_construct_tap,
> > .get_stats = netdev_tap_get_stats,
> > .get_features = netdev_linux_get_features,
> > @@ -3343,10 +3352,26 @@ const struct netdev_class
> > netdev_internal_class = {
> > NETDEV_LINUX_CLASS_COMMON,
> > LINUX_FLOW_OFFLOAD_API,
> > .type = "internal",
> > + .is_pmd = false,
> > .construct = netdev_linux_construct,
> > .get_stats = netdev_internal_get_stats,
> > .get_status = netdev_internal_get_status,
> > };
> > +
> > +#ifdef HAVE_AF_XDP
> > +const struct netdev_class netdev_afxdp_class = {
> > + NETDEV_LINUX_CLASS_COMMON,
> > + .type = "afxdp",
> > + .is_pmd = true,
> > + .construct = netdev_linux_construct,
> > + .get_stats = netdev_linux_get_stats,
> > + .get_status = netdev_linux_get_status,
> > + .set_config = netdev_afxdp_set_config,
> > + .get_config = netdev_afxdp_get_config,
> > + .reconfigure = netdev_afxdp_reconfigure,
> > + .get_numa_id = netdev_afxdp_get_numa_id,
> > +};
> > +#endif
> >
> >
> > #define CODEL_N_QUEUES 0x0000
> > diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
> > index 17ca9120168a..b812e64cb078 100644
> > --- a/lib/netdev-linux.h
> > +++ b/lib/netdev-linux.h
> > @@ -19,6 +19,20 @@
> >
> > #include <stdint.h>
> > #include <stdbool.h>
> > +#include <linux/filter.h>
> > +#include <linux/gen_stats.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_tun.h>
> > +#include <linux/types.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/mii.h>
> > +
> > +#include "netdev-provider.h"
> > +#include "netdev-tc-offloads.h"
> > +#include "netdev-vport.h"
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +#include "timer.h"
>
> Is there a reason why you move all these includes here? If there is you
> might as well remove the duplicates from .c files that include
> netdev-linux.h, for example, netdev-linux.c
<snip>
Will work on this part later.
> > +static inline void
> > +ovs_spin_unlock(ovs_spinlock_t *sl)
> > +{
> > + atomic_store_explicit(&sl->locked, 0, memory_order_release);
> > +}
> > +
> > +static inline int OVS_UNUSED
> > +ovs_spin_trylock(ovs_spinlock_t *sl)
> > +{
> > + int exp = 0;
> > + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp,
> > 1,
> > + memory_order_acquire,
> > + memory_order_relaxed);
> > +}
>
> Move spinlock function out to a common file
>
OK, I plan to add lib/spinlock.h and move to it.
> > +
> > +inline int
> > +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > + void *ptr;
> > +
> > + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) {
>
> This is a stack overflow
>
> > + return -ENOMEM;
> > + }
> > +
> > + ptr = &umemp->array[umemp->index];
> > + memcpy(ptr, addrs, n * sizeof(void *));
> > + umemp->index += n;
> > +
> > + return 0;
> > +}
> > +
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > + int ret;
> > +
> > + ovs_spin_lock(&umemp->mutex);
> > + ret = __umem_elem_push_n(umemp, n, addrs);
> > + ovs_spin_unlock(&umemp->mutex);
> > +
> > + return ret;
> > +}
> > +
> > +inline void
> > +__umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > + umemp->array[umemp->index++] = addr;
> > +}
> > +
> > +void
> > +umem_elem_push(struct umem_pool *umemp, void *addr)
> > +{
> > +
> > + if (OVS_UNLIKELY(umemp->index >= umemp->size)) {
> > + /* stack is overflow, this should not happen */
> > + OVS_NOT_REACHED();
> > + }
>
> Should this not be moved after the spinlock, i.e. to __umem_elem_push
OK
>
> > +
> > + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0);
> > +
> > + ovs_spin_lock(&umemp->mutex);
> > + __umem_elem_push(umemp, addr);
> > + ovs_spin_unlock(&umemp->mutex);
> > +}
> > +
> > +inline int
> > +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > + void *ptr;
> > +
> > + if (OVS_UNLIKELY(umemp->index - n < 0)) {
> > + return -ENOMEM;
> > + }
> > +
> > + umemp->index -= n;
> > + ptr = &umemp->array[umemp->index];
> > + memcpy(addrs, ptr, n * sizeof(void *));
> > +
> > + return 0;
> > +}
> > +
> > +int
> > +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs)
> > +{
> > + int ret;
> > +
> > + ovs_spin_lock(&umemp->mutex);
> > + ret = __umem_elem_pop_n(umemp, n, addrs);
> > + ovs_spin_unlock(&umemp->mutex);
> > +
> > + return ret;
> > +}
> > +
> > +inline void *
> > +__umem_elem_pop(struct umem_pool *umemp)
> > +{
>
> There is no check here to see if there are actual any elements left,
> like there is for pop_n,
> so we could corrupt memory/umem_pool
>
Yes, I will add a check.
> > + return umemp->array[--umemp->index];
> > +}
> > +
> > +void *
> > +umem_elem_pop(struct umem_pool *umemp)
> > +{
> > + void *ptr;
> > +
> > + ovs_spin_lock(&umemp->mutex);
> > + ptr = __umem_elem_pop(umemp);
> > + ovs_spin_unlock(&umemp->mutex);
> > +
> > + return ptr;
> > +}
> > +
> > +void **
> > +__umem_pool_alloc(unsigned int size)
> > +{
> > + void *bufs;
> > +
> > + ovs_assert(posix_memalign(&bufs, getpagesize(),
> > + size * sizeof(void *)) == 0);
>
> We should not assert, just return NULL here.
>
> > + memset(bufs, 0, size * sizeof(void *));
> > + return (void **)bufs;
> > +}
> > +
> > +unsigned int
> > +umem_elem_count(struct umem_pool *mpool)
> > +{
> > + return mpool->index;
> > +}
> > +
> > +int
> > +umem_pool_init(struct umem_pool *umemp, unsigned int size)
> > +{
> > + umemp->array = __umem_pool_alloc(size);
> > + if (!umemp->array) {
> > + OVS_NOT_REACHED();
>
> If NULL is returned return ENOMEM
>
> > + }
> > +
> > + umemp->size = size;
> > + umemp->index = 0;
> > + ovs_spinlock_init(&umemp->mutex);
> > + return 0;
> > +}
> > +
> > +void
> > +umem_pool_cleanup(struct umem_pool *umemp)
> > +{
> > + free(umemp->array);
> umemp->array = NULL;
> > +}
> > +
> > +/* AF_XDP metadata init/destroy */
> > +int
> > +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size)
> > +{
> > + void *bufs;
> > +
> > + /* TODO: check HAVE_POSIX_MEMALIGN */
>
> Guess the above needs to be done
>
> > + ovs_assert(posix_memalign(&bufs, getpagesize(),
> > + size * sizeof(struct dp_packet_afxdp))
> > == 0);
>
> We should not assert, just return false
>
> > + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp));
> > +
> > + xp->array = bufs;
> > + xp->size = size;
> > + return 0;
> > +}
> > +
> > +void
> > +xpacket_pool_cleanup(struct xpacket_pool *xp)
> > +{
> > + free(xp->array);
> xp->array = NULL;
> > +}
> > diff --git a/lib/xdpsock.h b/lib/xdpsock.h
> > new file mode 100644
> > index 000000000000..aabaa8e5df24
> > --- /dev/null
> > +++ b/lib/xdpsock.h
> > @@ -0,0 +1,123 @@
> > +/*
> > + * Copyright (c) 2018, 2019 Nicira, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing,
> > software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions
> > and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef XDPSOCK_H
> > +#define XDPSOCK_H 1
> > +
> > +#include <bpf/libbpf.h>
> > +#include <bpf/xsk.h>
> > +#include <errno.h>
> > +#include <getopt.h>
> > +#include <libgen.h>
> > +#include <linux/bpf.h>
> > +#include <linux/if_link.h>
> > +#include <linux/if_xdp.h>
> > +#include <linux/if_ether.h>
> > +#include <locale.h>
> > +#include <net/if.h>
> > +#include <poll.h>
> > +#include <pthread.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/resource.h>
> > +#include <sys/socket.h>
> > +#include <sys/types.h>
> > +#include <sys/mman.h>
> > +#include <time.h>
> > +#include <unistd.h>
> > +
> > +#include "openvswitch/thread.h"
> > +#include "ovs-atomic.h"
> > +
> > +#define FRAME_HEADROOM XDP_PACKET_HEADROOM
> > +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
> > +#define BATCH_SIZE NETDEV_MAX_BURST
>
> Move this item to the bottom, so you have FRAME specific define's first
>
> > +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT
> > +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1)
> > +
> > +#define NUM_FRAMES 4096
>
> Should we add a note/check to make sure this value is a power of 2?
Sure, will do it.
>
> > +#define PROD_NUM_DESCS 512
> > +#define CONS_NUM_DESCS 512
> > +
> > +#ifdef USE_XSK_DEFAULT
> > +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
> > +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#endif
>
> Any reason for having this? Should we use the default values? They are
> 4x larger than you have, did it make any difference in performance
> results?
> We could make it configurable like for DPDK, using the
> n_txq_desc/n_rxq_desc option.
Will add this option later.
>
> > +
> > +typedef struct {
> > + atomic_int locked;
> > +} ovs_spinlock_t;
> > +
>
> Think we should move the ovs_spinlock code and includes to some global
> place, maybe util or thread
ok, will move to lib/spinlock.h
>
> > +/* LIFO ptr_array */
> > +struct umem_pool {
> > + int index; /* point to top */
> > + unsigned int size;
> > + ovs_spinlock_t mutex;
> > + void **array; /* a pointer array, point to umem buf */
> > +};
> > +
> > +/* array-based dp_packet_afxdp */
> > +struct xpacket_pool {
> > + unsigned int size;
> > + struct dp_packet_afxdp **array;
> > +};
> > +
> > +struct xsk_umem_info {
> > + struct umem_pool mpool;
> > + struct xpacket_pool xpool;
> > + struct xsk_ring_prod fq;
> > + struct xsk_ring_cons cq;
> > + struct xsk_umem *umem;
> > + void *buffer;
> > +};
> > +
> > +struct xsk_socket_info {
> > + struct xsk_ring_cons rx;
> > + struct xsk_ring_prod tx;
> > + struct xsk_umem_info *umem;
> > + struct xsk_socket *xsk;
> > + unsigned long rx_npkts;
> > + unsigned long tx_npkts;
> > + unsigned long prev_rx_npkts;
> > + unsigned long prev_tx_npkts;
> > + uint32_t outstanding_tx;
> > +};
> > +
> > +struct umem_elem {
> > + struct umem_elem *next;
> > +};
> > +
> > +void __umem_elem_push(struct umem_pool *umemp, void *addr);
> > +void umem_elem_push(struct umem_pool *umemp, void *addr);
> > +int __umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void *__umem_elem_pop(struct umem_pool *umemp);
> > +void *umem_elem_pop(struct umem_pool *umemp);
> > +int __umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs);
> > +
> > +void **__umem_pool_alloc(unsigned int size);
> > +int umem_pool_init(struct umem_pool *umemp, unsigned int size);
> > +void umem_pool_cleanup(struct umem_pool *umemp);
> > +unsigned int umem_elem_count(struct umem_pool *mpool);
> > +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size);
> > +void xpacket_pool_cleanup(struct xpacket_pool *xp);
> > +
>
> Think all the __umem_* function are only used internally so they should
> be come static and be removed here.
>
OK.
Regards,
William
More information about the dev
mailing list