[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

Jesper Dangaard Brouer brouer at redhat.com
Fri Mar 5 10:13:06 UTC 2021


Bjørn and Magnus please take a look at my questions inlined below.

On Thu,  4 Mar 2021 10:27:05 -0800
William Tu <u9012063 at gmail.com> wrote:

> One big problem of netdev-afxdp is that there is no metadata support
> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
> or TCP checksum in software, resulting in high performance overhead.
> 
> A generic meta data type for XDP frame using BTF is proposed[1] and
> there is sample implementation[2][3].  This patch experiments enabling 
> the XDP metadata, or called HW hints, and shows the potential performance
> improvement.  The patch uses only the rxhash value provided from HW,
> so avoiding at the calculation of hash at lib/dpif-netdev.c:
>     if (!dp_packet_rss_valid(execute->packet)) {
>         dp_packet_set_rss_hash(execute->packet,
>                                flow_hash_5tuple(execute->flow, 0));
>     }
> 
> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
> cycles per packet' drops from 402 to 272.  More details below
> 
> Reference:
> ----------
> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
> 
> Testbed:
> --------
> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
> $ bpftool net xdp show
> xdp:
> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> $ bpftool net xdp
> xdp:
> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
> 
> Limitations/TODO:
> -----------------
> 1. Support only AF_XDP native mode, not zero-copy mode.
> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
>    side supports XDP metadata.
> 3. Control plane, how to enable and probe the structure, not upstream yet.
> 
> OVS rxdrop without HW hints:
> ---------------------------
> Drop rate: 4.8Mpps
> 
> pmd thread numa_id 0 core_id 3:
>   packets received: 196592006
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 196592006
>   smc hits: 0
>   megaflow hits: 0
>   avg. subtable lookups per megaflow hit: 0.00
>   miss with success upcall: 0
>   miss with failed upcall: 0
>   avg. packets per output batch: 0.00
>   idle cycles: 56009063835 (41.43%)
>   processing cycles: 79164971931 (58.57%)
>   avg cycles per packet: 687.59 (135174035766/196592006)
>   avg processing cycles per packet: 402.69 (79164971931/196592006)
> 
> pmd thread numa_id 0 core_id 3:
>   Iterations:           339607649  (0.23 us/it)
>   - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
>   - idle iterations:    330697002  ( 40.3 % of used cycles)
>   - busy iterations:      8910647  ( 59.7 % of used cycles)
>   Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
>   Datapath passes:      285140031  (1.00 passes/pkt)
>   - EMC hits:           285139999  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>   - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:                   0
> 
> Perf report:
>   17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>   14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>   14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
>   10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
>   10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
>    7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
>    5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
>    3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
>    2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
>    2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
>    2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
>    2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
>    1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd at plt
> 
> OVS rxdrop with HW hints:
> -------------------------
> rxdrop rate: 4.73Mpps
> 
> pmd thread numa_id 0 core_id 7:
>   packets received: 13686880
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 13686880
>   smc hits: 0
>   megaflow hits: 0
>   avg. subtable lookups per megaflow hit: 0.00
>   miss with success upcall: 0
>   miss with failed upcall: 0
>   avg. packets per output batch: 0.00
>   idle cycles: 3182105544 (46.02%)
>   processing cycles: 3732023844 (53.98%)
>   avg cycles per packet: 505.16 (6914129388/13686880)
>   avg processing cycles per packet: 272.67 (3732023844/13686880)
> 
> pmd thread numa_id 0 core_id 7:
> 
>   Iterations:           392909539  (0.18 us/it)
>   - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
>   - idle iterations:    382539861  ( 46.0 % of used cycles)
>   - busy iterations:     10369678  ( 54.0 % of used cycles)
>   Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
>   Datapath passes:      331829656  (1.00 passes/pkt)
>   - EMC hits:           331829656  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>   - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:                   0
> 
> Perf record/report:
>   22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>   10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
>    7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
>    7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
>    6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>    6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
>    5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>    5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
>    3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
>    3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
>    2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
>    2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
>    2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
>    1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
>    1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
>    1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
> 
> Signed-off-by: William Tu <u9012063 at gmail.com>
> ---
>  lib/netdev-afxdp.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> index 482400d8d135..49881a8cc0cb 100644
> --- a/lib/netdev-afxdp.c
> +++ b/lib/netdev-afxdp.c
> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
>      );
>  };
>  
> +/* FIXME:
> + * This should be done dynamically by query the device's
> + * XDP metadata structure. Ex:
> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
> + */
> +struct xdp_md_desc {

Adding:
       uint32_t btf_id;

Would be valuable IMHO.

> +    uint32_t flow_mark;
> +    uint32_t hash32;
> +    uint16_t vlan;
> +};
> +
>  #ifdef HAVE_XDP_NEED_WAKEUP
>  static inline void
>  xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          struct dp_packet_afxdp *xpacket;
>          const struct xdp_desc *desc;
>          struct dp_packet *packet;
> +        struct xdp_md_desc *md;
>          uint64_t addr, index;
>          uint32_t len;
>          char *pkt;
> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>          len = desc->len;
>  
>          pkt = xsk_umem__get_data(umem->buffer, addr);
> +        md = pkt - sizeof *md;

We also want this sizeof-offset to be dynamic.
And I hope that AF_XDP could provide this info, on the size of
data_meta area.  I need some help from Magnus and Bjørn here?

Look at the kernel-side code of AF_XDP/XSK I don't see this being
transferred? Help is this true? Can we add the info?



>          index = addr >> FRAME_SHIFT;
>          xpacket = &umem->xpool.array[index];
>          packet = &xpacket->packet;
> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>                              OVS_XDP_HEADROOM);
>          dp_packet_set_size(packet, len);
>  
> +        /* FIXME: This should be done by detecting whether
> +         * XDP MD is enabled or not. Ex:
> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> +         */
> +        dp_packet_set_rss_hash(packet, md->hash32);

The general idea to make this code more dynamic is to let AF_XDP
provide the meta-data-len and then use the (proposed) btf_id to know
what "callback" C-code to call.

Using a btf_id make this independent of the NIC driver.  And we can
tell the difference between two BPF structs that contain the same info
but at different offsets. (I'll leave out the details, but I can
explain more if there is interest).

The btf_id also helps when NIC reconfigure the meta-data contents, and
there are packets in-flight.


> +
>          /* Add packet into batch, increase batch->count. */
>          dp_packet_batch_add(batch, packet);
>  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



More information about the dev mailing list