[ovs-dev] [PATCHv8] netdev-afxdp: add new netdev type for AF_XDP.

Fri May 17 12:39:03 UTC 2019

Hi.
Just a few comments to the issues you're listed.

Best regards, Ilya Maximets.

On 17.05.2019 13:23, Eelco Chaudron wrote:
> Hi William,
> 
> First a list of issues I found during some basic testing...
> 
> - When I restart or stop OVS (using the systemctl interface as found in RHEL) it does not clean up the BFP program causing the restart to fail:
> 
>   2019-05-10T09:12:11.384Z|00042|netdev_afxdp|ERR|AF_XDP device eno1 reconfig fails
>   2019-05-10T09:12:11.384Z|00043|dpif_netdev|ERR|Failed to set interface eno1 new configuration
> 
>   I need to manually run "ip link set dev eno1 xdp off" to make it recover.

Userspace datapath requires '--cleanup' option passed to 'ovs-appctl exit'
to clean up allocated resources. Otherwise datapath will not be destroyed,
i.e. netdev will not be destroyed --> no xdp program unloading.

> 
> - When I remove a bridge, I get an emer in the revalidator:
> 
>   2019-05-10T09:40:34.401Z|00045|netdev_afxdp|INFO|remove xdp program
>   2019-05-10T09:40:34.652Z|00001|util(revalidator49)|EMER|lib/poll-loop.c:111: assertion !fd != !wevent failed in poll_create_node()
> 

This actually should never happen. Looks like a memory corruption.

>   Easy to replicate with this:
> 
>     $ ovs-vsctl add-br ovs_pvp_br0 -- set bridge ovs_pvp_br0 datapath_type=netdev
>     $ ovs-vsctl add-port ovs_pvp_br0 eno1 -- set interface eno1 type="afxdp" options:xdpmode=drv
>     $ ovs-vsctl del-br ovs_pvp_br0
> 
> 
> - High pmd usage on the statistics, even with no packets is this expected?
> 
>   $ ovs-appctl dpif-netdev/pmd-rxq-show
>   pmd thread numa_id 0 core_id 1:
>     isolated : false
>     port: dpdk0             queue-id:  0  pmd usage:  0 %
>     port: eno1              queue-id:  0  pmd usage: 49 %
> 
>   It goes up slowly and gets stuck at 49%
> 
> 
> - When doing the PVP testing I noticed that the physical port has odd/no
>   tx statistics:
> 
>   $ ovs-ofctl dump-ports ovs_pvp_br0
>   OFPST_PORT reply (xid=0x2): 3 ports
>     port LOCAL: rx pkts=0, bytes=0, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>     port  eno1: rx pkts=103256197, bytes=6195630508, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=0, bytes=19789272440056, drop=0, errs=0, coll=0
>     port  tapVM: rx pkts=4043, bytes=501278, drop=0, errs=0, frame=0, over=0, crc=0
>              tx pkts=4058, bytes=502504, drop=0, errs=0, coll=0
> 
> 
> - Packets larger than 1028 bytes are dropped. Guess this needs to be fixed, and we need to state that jumbo frames are not supported. Are you planning on adding this?
> 
>   Currently I can find not mentioning of MTU limitation in the documentation, or any code to prevent it from being changed above the supported limit.

Actually Jumbo frames are supported, but yes, the packet size
is limited by the page size. So, jumbo frames up to ~3.5K should
be supported without issues.
We'll need to determine the upper limit and reject requested mtu
if it's larger.

> 
> 
> - ovs-vswitchd is still crashing or stops forwarding packets when trying to do
>   PVP testing with Qemu that has a TAP interface doing XDP and running packets
>   at wire speed to the 10G interface.

Actually, there are a lot of places in current version where rings/umems could
be corrupted leading to unpredictable memory corruptions/crashes/time wasting
trying to allocate exhausted resources.

> 
>   When trying with lower volume packets it seems to work, so with 1% traffic
>   rate, it forwards packets without any problems (148,771 pps). If I go to
>   10% the first couple of packet pass, then it stops forwarding. If it's not
>   crashing I still see packets being received by eno1 flow rules, but no
>   packets make it to the VM.

<snip>

> 
> 
> 
> The following might be useful when combining DPDK and AF_XDP:
> 
>   Currently, DPDK and AF_XDP polling can be combined on a single PMD thread, it
>   might be nice to have an option to not do this, i.e. have separate PMD
>   threads for each type. I know we can do this with assigning specific PMDs to
>   queues, but this will disable auto-balancing. This will also help later if
>   we would add poll() mode support for AF_XDP.

This might make some sense, but certainly not on this stage of development.
I don't think that we should expect any production level performance or fine grained
solution that must work perfectly in all corner cases.
For now it's better to focus on making it reliable. At least to handle all the control
sequences (stop/start/restart/reconfigure) and ability to forward traffic without
hangs/crashes/memory corruptions.

<snip>

>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>> index 859c05613ddf..cc91720fad6e 100644
>> --- a/lib/dpif-netdev-perf.h
>> +++ b/lib/dpif-netdev-perf.h
>> @@ -198,6 +198,20 @@ cycles_counter_update(struct pmd_perf_stats *s)
>>  {
>>  #ifdef DPDK_NETDEV
>>      return s->last_tsc = rte_get_tsc_cycles();
>> +#elif HAVE_AF_XDP
> 
> We need to add support for at least ARM and PPC, not sure how to do this nicely.

IMHO, it's not required for the experimental feature. But yes,
we'll need to add support later. I'm thinking about fallback to
CLOCK_MONOTONIC_RAW (not that portable too) and further to just
time_usec().

> This code is already a quick cut/paste from DPDK, license?

Good question.
If you worried, I could gift a version written from scratch (I swear):

---
    uint32_t h, l;

    asm volatile("rdtsc" : "=a" (l), "=d" (h));

    return s->last_tsc = ((uint64_t) h << 32) | l;
---

> 
>> +    /* This is x86-specific instructions. */
>> +    union {
>> +        uint64_t tsc_64;
>> +        struct {
>> +            uint32_t lo_32;
>> +            uint32_t hi_32;
>> +        };
>> +    } tsc;
>> +    asm volatile("rdtsc" :
>> +             "=a" (tsc.lo_32),
>> +             "=d" (tsc.hi_32));
>> +
>> +    return s->last_tsc = tsc.tsc_64;
>>  #else
>>      return s->last_tsc = 0;
>>  #endif
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> new file mode 100644
>> index 000000000000..cd1b9ca8be77
>> --- /dev/null
>> +++ b/lib/netdev-afxdp.c
>> @@ -0,0 +1,727 @@
>> +/*
>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#if !defined(__i386__) && !defined(__x86_64__)
>> +#error AF_XDP supported only for Linux on x86 or x86_64
> 
> Any reason why we do not support PPC and ARM?

Simple: rdtsc.

Actually, right now we need to restrict support to only x86_64, because
above rdtsc is in 64bit form and will not work for 32bit cpu.

<snip>

>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
>> new file mode 100644
>> index 000000000000..3dd3d902b3c4
>> --- /dev/null
>> +++ b/lib/netdev-linux-private.h
>> @@ -0,0 +1,124 @@
>> +/*
>> + * Copyright (c) 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef NETDEV_LINUX_PRIVATE_H
>> +#define NETDEV_LINUX_PRIVATE_H 1
>> +
>> +#include <config.h>
>> +
>> +#include <linux/filter.h>
>> +#include <linux/gen_stats.h>
>> +#include <linux/if_ether.h>
>> +#include <linux/if_tun.h>
>> +#include <linux/types.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/mii.h>
>> +#include <stdint.h>
>> +#include <stdbool.h>
>> +
>> +#include "netdev-provider.h"
>> +#include "netdev-tc-offloads.h"
>> +#include "netdev-vport.h"
>> +#include "openvswitch/thread.h"
>> +#include "ovs-atomic.h"
>> +#include "timer.h"
> 
> Why include all the above? They where just added to netdev-linux.h, so if you make sure you include netdev-lunux.h before -private it should work out.

This doesn't look right. File should include everything it uses.
If something from above headers used in this file, headers should
stay. But if the header is not used in this file, it should be
not included here. Otherwise we'll mess up all the includes.

<snip>

>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>  {
>>      struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
>>      struct netdev *netdev = rx->up.netdev;
>> -    struct dp_packet *buffer;
>> +    struct dp_packet *buffer = NULL;
>>      ssize_t retval;
>>      int mtu;
>>
>> +#if HAVE_AF_XDP
> 
> Think this #if HAVE_AF_XDP can be removed as the compiler should optimize out the if (false).

I guess this will cause build failure with -O0.

<snip>

>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h
>> new file mode 100644
>> index 000000000000..aabaa8e5df24
>> --- /dev/null
>> +++ b/lib/xdpsock.h
>> @@ -0,0 +1,123 @@
>> +/*
>> + * Copyright (c) 2018, 2019 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef XDPSOCK_H
>> +#define XDPSOCK_H 1
>> +
>> +#include <bpf/libbpf.h>
>> +#include <bpf/xsk.h>
>> +#include <errno.h>
>> +#include <getopt.h>
>> +#include <libgen.h>
>> +#include <linux/bpf.h>
>> +#include <linux/if_link.h>
>> +#include <linux/if_xdp.h>
>> +#include <linux/if_ether.h>
>> +#include <locale.h>
>> +#include <net/if.h>
>> +#include <poll.h>
>> +#include <pthread.h>
>> +#include <signal.h>
>> +#include <stdbool.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/resource.h>
>> +#include <sys/socket.h>
>> +#include <sys/types.h>
>> +#include <sys/mman.h>
>> +#include <time.h>
>> +#include <unistd.h>
>> +
>> +#include "openvswitch/thread.h"
>> +#include "ovs-atomic.h"
>> +
>> +#define FRAME_HEADROOM  XDP_PACKET_HEADROOM
>> +#define FRAME_SIZE      XSK_UMEM__DEFAULT_FRAME_SIZE
>> +#define BATCH_SIZE      NETDEV_MAX_BURST
> 
> Move this item to the bottom, so you have FRAME specific define's first
> 
>> +#define FRAME_SHIFT     XSK_UMEM__DEFAULT_FRAME_SHIFT
>> +#define FRAME_SHIFT_MASK    ((1 << FRAME_SHIFT) - 1)
>> +
>> +#define NUM_FRAMES      4096
> 
> Should we add a note/check to make sure this value is a power of 2?
> 
>> +#define PROD_NUM_DESCS  512
>> +#define CONS_NUM_DESCS  512
>> +
>> +#ifdef USE_XSK_DEFAULT
>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
>> +#endif
> 
> Any reason for having this? Should we use the default values? They are 4x larger than you have, did it make any difference in performance results?
> We could make it configurable like for DPDK, using the n_txq_desc/n_rxq_desc option.

I think, this is not necessary right now and could be done later.

<snip>