[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

Fri Mar 5 15:07:19 UTC 2021

On Fri, 5 Mar 2021 14:56:02 +0100
Björn Töpel <bjorn.topel at intel.com> wrote:

> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
> > 
> > Bjørn and Magnus please take a look at my questions inlined below.
> > 
> > On Thu,  4 Mar 2021 10:27:05 -0800
> > William Tu <u9012063 at gmail.com> wrote:
> >   
> >> One big problem of netdev-afxdp is that there is no metadata support
> >> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
> >> or TCP checksum in software, resulting in high performance overhead.
> >>
> >> A generic meta data type for XDP frame using BTF is proposed[1] and
> >> there is sample implementation[2][3].  This patch experiments enabling
> >> the XDP metadata, or called HW hints, and shows the potential performance
> >> improvement.  The patch uses only the rxhash value provided from HW,
> >> so avoiding at the calculation of hash at lib/dpif-netdev.c:
> >>      if (!dp_packet_rss_valid(execute->packet)) {
> >>          dp_packet_set_rss_hash(execute->packet,
> >>                                 flow_hash_5tuple(execute->flow, 0));
> >>      }
> >>
> >> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
> >> cycles per packet' drops from 402 to 272.  More details below
> >>
> >> Reference:
> >> ----------
> >> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
> >> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
> >> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
> >>
> >> Testbed:
> >> --------
> >> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
> >> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
> >> $ bpftool net xdp show
> >> xdp:
> >> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
> >> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
> >> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> >> $ bpftool net xdp
> >> xdp:
> >> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
> >>
> >> Limitations/TODO:
> >> -----------------
> >> 1. Support only AF_XDP native mode, not zero-copy mode.
> >> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
> >>     side supports XDP metadata.
> >> 3. Control plane, how to enable and probe the structure, not upstream yet.
> >>
> >> OVS rxdrop without HW hints:
> >> ---------------------------
> >> Drop rate: 4.8Mpps
> >>
> >> pmd thread numa_id 0 core_id 3:
> >>    packets received: 196592006
> >>    packet recirculations: 0
> >>    avg. datapath passes per packet: 1.00
> >>    emc hits: 196592006
> >>    smc hits: 0
> >>    megaflow hits: 0
> >>    avg. subtable lookups per megaflow hit: 0.00
> >>    miss with success upcall: 0
> >>    miss with failed upcall: 0
> >>    avg. packets per output batch: 0.00
> >>    idle cycles: 56009063835 (41.43%)
> >>    processing cycles: 79164971931 (58.57%)
> >>    avg cycles per packet: 687.59 (135174035766/196592006)
> >>    avg processing cycles per packet: 402.69 (79164971931/196592006)
> >>
> >> pmd thread numa_id 0 core_id 3:
> >>    Iterations:           339607649  (0.23 us/it)
> >>    - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
> >>    - idle iterations:    330697002  ( 40.3 % of used cycles)
> >>    - busy iterations:      8910647  ( 59.7 % of used cycles)
> >>    Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
> >>    Datapath passes:      285140031  (1.00 passes/pkt)
> >>    - EMC hits:           285139999  (100.0 %)
> >>    - SMC hits:                   0  (  0.0 %)
> >>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
> >>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
> >>    - Lost upcalls:               0  (  0.0 %)
> >>    Tx packets:                   0
> >>
> >> Perf report:
> >>    17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
> >>    14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
> >>    14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
> >>    10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
> >>    10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
> >>     7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
> >>     5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
> >>     3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
> >>     2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
> >>     2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
> >>     2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
> >>     2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
> >>     1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd at plt
> >>
> >> OVS rxdrop with HW hints:
> >> -------------------------
> >> rxdrop rate: 4.73Mpps
> >>
> >> pmd thread numa_id 0 core_id 7:
> >>    packets received: 13686880
> >>    packet recirculations: 0
> >>    avg. datapath passes per packet: 1.00
> >>    emc hits: 13686880
> >>    smc hits: 0
> >>    megaflow hits: 0
> >>    avg. subtable lookups per megaflow hit: 0.00
> >>    miss with success upcall: 0
> >>    miss with failed upcall: 0
> >>    avg. packets per output batch: 0.00
> >>    idle cycles: 3182105544 (46.02%)
> >>    processing cycles: 3732023844 (53.98%)
> >>    avg cycles per packet: 505.16 (6914129388/13686880)
> >>    avg processing cycles per packet: 272.67 (3732023844/13686880)
> >>
> >> pmd thread numa_id 0 core_id 7:
> >>
> >>    Iterations:           392909539  (0.18 us/it)
> >>    - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
> >>    - idle iterations:    382539861  ( 46.0 % of used cycles)
> >>    - busy iterations:     10369678  ( 54.0 % of used cycles)
> >>    Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
> >>    Datapath passes:      331829656  (1.00 passes/pkt)
> >>    - EMC hits:           331829656  (100.0 %)
> >>    - SMC hits:                   0  (  0.0 %)
> >>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
> >>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
> >>    - Lost upcalls:               0  (  0.0 %)
> >>    Tx packets:                   0
> >>
> >> Perf record/report:
> >>    22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
> >>    10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
> >>     7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
> >>     7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
> >>     6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
> >>     6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
> >>     5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
> >>     5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
> >>     3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
> >>     3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
> >>     2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
> >>     2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
> >>     2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
> >>     1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
> >>     1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
> >>     1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
> >>
> >> Signed-off-by: William Tu <u9012063 at gmail.com>
> >> ---
> >>   lib/netdev-afxdp.c | 19 +++++++++++++++++++
> >>   1 file changed, 19 insertions(+)
> >>
> >> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
> >> index 482400d8d135..49881a8cc0cb 100644
> >> --- a/lib/netdev-afxdp.c
> >> +++ b/lib/netdev-afxdp.c
> >> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
> >>       );
> >>   };
> >>   
> >> +/* FIXME:
> >> + * This should be done dynamically by query the device's
> >> + * XDP metadata structure. Ex:
> >> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
> >> + */
> >> +struct xdp_md_desc {  
> > 
> > Adding:
> >         uint32_t btf_id;
> > 
> > Would be valuable IMHO.
> >   
> >> +    uint32_t flow_mark;
> >> +    uint32_t hash32;
> >> +    uint16_t vlan;
> >> +};
> >> +
> >>   #ifdef HAVE_XDP_NEED_WAKEUP
> >>   static inline void
> >>   xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
> >> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>           struct dp_packet_afxdp *xpacket;
> >>           const struct xdp_desc *desc;
> >>           struct dp_packet *packet;
> >> +        struct xdp_md_desc *md;
> >>           uint64_t addr, index;
> >>           uint32_t len;
> >>           char *pkt;
> >> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>           len = desc->len;
> >>   
> >>           pkt = xsk_umem__get_data(umem->buffer, addr);
> >> +        md = pkt - sizeof *md;  
> > 
> > We also want this sizeof-offset to be dynamic.
> > And I hope that AF_XDP could provide this info, on the size of
> > data_meta area.  I need some help from Magnus and Bjørn here?
> > 
> > Look at the kernel-side code of AF_XDP/XSK I don't see this being
> > transferred? Help is this true? Can we add the info?
> >  
> 
> XDP sockets make sure that the meta data is propagated to userland.
> However, the meta data *size* is not passed via the AF_XDP descriptors.
> 
> This was per-design; The metadata is a contract between the XDP program
> and the XDP socket, and in the future maybe the driver as well.
> 
> So, we're simply stating that "your AF_XDP application should know if
> there's meta data or not, and how it's structured".
> 
> I guess we could explore starting to use the options bits of struct
> xdp_desc for this, but only if we're *really* seeing a need for it.
> 
> If the XDP program would like to pass it's length, it can do that via
> the metadata area as well.
> 
> struct meta {
>    int len;

I'm not interested in having the len part of metadata.   The btf_id
will basically/indirectly give us the length.

>    ...
>    ...
> };
> 
> 
> > 
> >   
> >>           index = addr >> FRAME_SHIFT;
> >>           xpacket = &umem->xpool.array[index];
> >>           packet = &xpacket->packet;
> >> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >>                               OVS_XDP_HEADROOM);
> >>           dp_packet_set_size(packet, len);
> >>   
> >> +        /* FIXME: This should be done by detecting whether
> >> +         * XDP MD is enabled or not. Ex:
> >> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
> >> +         */
> >> +        dp_packet_set_rss_hash(packet, md->hash32);  
> > 
> > The general idea to make this code more dynamic is to let AF_XDP
> > provide the meta-data-len and then use the (proposed) btf_id to know
> > what "callback" C-code to call.
> >  
> 
> Hmm, again, here I might have a bit of a different view. Personally I 
> think it's better that all the "NIC/XDP/socket" does the negotiation in 
> the control/setup phase, and keep the fast-path lean. Instead of:
> 
> recv(void *data)
> {
>     len = get_meta_data_len(data);
>     if (len) {
>        // check this, check that, if-else-if-else
>     }
>     ...
> }
> 
> You'd like to use the length of metadata to express if it's available or 
> not. I'm stating that we don't need that for AF_XDP. In XDP if we're 
> accessing the field, we need to validate access due to the verifier. But 
> it's not to know if it's actually there. When we attach a certain XDP to 
> a device, we can query what offload it supports, and the the application 
> knows that meta data is structured in a certain way.
> 
> Some application has simple metadata:
> struct tstamp { u64 rx; };
> 
> That application will unconditionally look at (struct tstamp 
> *)(pkt-sizeof(struct tstamp)->rx;.

I imagine that NIC hardware will/can provide packets with different
metadata.  Your example with timestamps are a good example, as some
drivers/HW only support timestamping PTP packets.  Thus, I don't want
to provide metadata info on tstamp if HW didn't provide it (and I also
don't want to spend CPU-cycles on clearing the u64 field with zero).
Another example is vlan IDs is sometimes provided by hardware.

Thus, on a per packet basis the metadata can change.

> Some might be more flexible and have:
> struct metadata { char storage[64]; int btf_id; };

I really like the idea of placing the btf_id as the last entry, that
way we know its location.  I can get the btf_id via packet minus-offset
sizeof(int).

>
> If you need an "availability check", then you can solve that.
> 
> int xdp_prog(struct xdp_md *ctx)
> {
> 	struct metadata *m = (void *)(long)ctx->data_meta;
> 	void *data = (void *)(unsigned long)ctx->data;
> 
>          if ((m + 1) > data) {
> 		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
> 		if (ret < 0)
> 			return XDP_ABORTED;
> 		data = (void *)(unsigned long)ctx->data;
> 		m = (void *)(long)ctx->data_meta;
> 		if ((m + 1) > data)
> 			return XDP_ABORTED;
> 		m->btf_id = -1; // not valid
> 	}
> 	// ... bpf_redirect_map() to socket
> }
> 
> Would you agree? Or maybe expand a bit why you need the length.

With the btf_id as the last entry, I basically don't need the length.

The dispatcher function, will based on the btf_id know the size of the
struct.  The C-code will basically type-cast the struct at the pointer
offset and get access to the metadata at the known offsets.

> 
> > Using a btf_id make this independent of the NIC driver.  

E.g. we don't need to dereference the NIC driver struct to find what
metadata this driver have.  Instead this is given directly by the
btf_id, and two drivers can use the same btf_id.

> > And we can
> > tell the difference between two BPF structs that contain the same info
> > but at different offsets. (I'll leave out the details, but I can
> > explain more if there is interest).
> >  

BTF sort-of make the struct names identifiers and API.  And gives *BPF*
programs the ability to adjust offsets, when loading the BPF-prog into
the kernel.  

Driver-1 choose:

 struct xdp_md_desc {  
     uint32_t flow_mark;
     uint32_t hash32;
 } __attribute__((preserve_access_index));

Driver-2 choose:

 struct xdp_md_desc {  
     uint32_t hash32;
     uint32_t flow_mark;
 } __attribute__((preserve_access_index));

The BPF code can access the members, and kernel will remap-struct
access, and internally in the kernel create two different BPF-programs
with adjusted offsets in the byte-code.

How do we handle this in the OVS userspace C-code?

Maybe/hopefully you have a better idea than mine, which is simply to
compile two C-programs (or program for-each know BTF-id) and dispatch
based on the BTF-id.

> Details, please! :-D Maybe I'm just not following your ideas!

I hope the details above helped.

> > The btf_id also helps when NIC reconfigure the meta-data contents, and
> > there are packets in-flight.

This is another reason why C-code need to handle meta-data can change
runtime, as I don't like to freeze config changes to NIC.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer