[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

Björn Töpel bjorn.topel at intel.com
Fri Mar 5 13:56:02 UTC 2021


On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:
> 
> Bjørn and Magnus please take a look at my questions inlined below.
> 
> On Thu,  4 Mar 2021 10:27:05 -0800
> William Tu <u9012063 at gmail.com> wrote:
> 
>> One big problem of netdev-afxdp is that there is no metadata support
>> from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
>> or TCP checksum in software, resulting in high performance overhead.
>>
>> A generic meta data type for XDP frame using BTF is proposed[1] and
>> there is sample implementation[2][3].  This patch experiments enabling
>> the XDP metadata, or called HW hints, and shows the potential performance
>> improvement.  The patch uses only the rxhash value provided from HW,
>> so avoiding at the calculation of hash at lib/dpif-netdev.c:
>>      if (!dp_packet_rss_valid(execute->packet)) {
>>          dp_packet_set_rss_hash(execute->packet,
>>                                 flow_hash_5tuple(execute->flow, 0));
>>      }
>>
>> Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
>> cycles per packet' drops from 402 to 272.  More details below
>>
>> Reference:
>> ----------
>> [1] https://www.kernel.org/doc/html/latest/bpf/btf.html
>> [2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4
>>
>> Testbed:
>> --------
>> Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
>> ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
>> $ bpftool net xdp show
>> xdp:
>> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
>> enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
>> $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>> $ bpftool net xdp
>> xdp:
>> enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)
>>
>> Limitations/TODO:
>> -----------------
>> 1. Support only AF_XDP native mode, not zero-copy mode.
>> 2. Currently only three fields: vlan, hash, and flow_mark, and only receive
>>     side supports XDP metadata.
>> 3. Control plane, how to enable and probe the structure, not upstream yet.
>>
>> OVS rxdrop without HW hints:
>> ---------------------------
>> Drop rate: 4.8Mpps
>>
>> pmd thread numa_id 0 core_id 3:
>>    packets received: 196592006
>>    packet recirculations: 0
>>    avg. datapath passes per packet: 1.00
>>    emc hits: 196592006
>>    smc hits: 0
>>    megaflow hits: 0
>>    avg. subtable lookups per megaflow hit: 0.00
>>    miss with success upcall: 0
>>    miss with failed upcall: 0
>>    avg. packets per output batch: 0.00
>>    idle cycles: 56009063835 (41.43%)
>>    processing cycles: 79164971931 (58.57%)
>>    avg cycles per packet: 687.59 (135174035766/196592006)
>>    avg processing cycles per packet: 402.69 (79164971931/196592006)
>>
>> pmd thread numa_id 0 core_id 3:
>>    Iterations:           339607649  (0.23 us/it)
>>    - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
>>    - idle iterations:    330697002  ( 40.3 % of used cycles)
>>    - busy iterations:      8910647  ( 59.7 % of used cycles)
>>    Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
>>    Datapath passes:      285140031  (1.00 passes/pkt)
>>    - EMC hits:           285139999  (100.0 %)
>>    - SMC hits:                   0  (  0.0 %)
>>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>>    - Lost upcalls:               0  (  0.0 %)
>>    Tx packets:                   0
>>
>> Perf report:
>>    17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>>    14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>    14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
>>    10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
>>    10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
>>     5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
>>     3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
>>     2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
>>     2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
>>     2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
>>     2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
>>     1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd at plt
>>
>> OVS rxdrop with HW hints:
>> -------------------------
>> rxdrop rate: 4.73Mpps
>>
>> pmd thread numa_id 0 core_id 7:
>>    packets received: 13686880
>>    packet recirculations: 0
>>    avg. datapath passes per packet: 1.00
>>    emc hits: 13686880
>>    smc hits: 0
>>    megaflow hits: 0
>>    avg. subtable lookups per megaflow hit: 0.00
>>    miss with success upcall: 0
>>    miss with failed upcall: 0
>>    avg. packets per output batch: 0.00
>>    idle cycles: 3182105544 (46.02%)
>>    processing cycles: 3732023844 (53.98%)
>>    avg cycles per packet: 505.16 (6914129388/13686880)
>>    avg processing cycles per packet: 272.67 (3732023844/13686880)
>>
>> pmd thread numa_id 0 core_id 7:
>>
>>    Iterations:           392909539  (0.18 us/it)
>>    - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
>>    - idle iterations:    382539861  ( 46.0 % of used cycles)
>>    - busy iterations:     10369678  ( 54.0 % of used cycles)
>>    Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
>>    Datapath passes:      331829656  (1.00 passes/pkt)
>>    - EMC hits:           331829656  (100.0 %)
>>    - SMC hits:                   0  (  0.0 %)
>>    - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
>>    - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
>>    - Lost upcalls:               0  (  0.0 %)
>>    Tx packets:                   0
>>
>> Perf record/report:
>>    22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
>>    10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
>>     7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
>>     7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
>>     6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>     6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
>>     5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
>>     3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
>>     3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
>>     2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
>>     2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
>>     2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
>>     1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
>>     1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
>>     1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration
>>
>> Signed-off-by: William Tu <u9012063 at gmail.com>
>> ---
>>   lib/netdev-afxdp.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
>> index 482400d8d135..49881a8cc0cb 100644
>> --- a/lib/netdev-afxdp.c
>> +++ b/lib/netdev-afxdp.c
>> @@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
>>       );
>>   };
>>   
>> +/* FIXME:
>> + * This should be done dynamically by query the device's
>> + * XDP metadata structure. Ex:
>> + *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
>> + */
>> +struct xdp_md_desc {
> 
> Adding:
>         uint32_t btf_id;
> 
> Would be valuable IMHO.
> 
>> +    uint32_t flow_mark;
>> +    uint32_t hash32;
>> +    uint16_t vlan;
>> +};
>> +
>>   #ifdef HAVE_XDP_NEED_WAKEUP
>>   static inline void
>>   xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
>> @@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>           struct dp_packet_afxdp *xpacket;
>>           const struct xdp_desc *desc;
>>           struct dp_packet *packet;
>> +        struct xdp_md_desc *md;
>>           uint64_t addr, index;
>>           uint32_t len;
>>           char *pkt;
>> @@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>           len = desc->len;
>>   
>>           pkt = xsk_umem__get_data(umem->buffer, addr);
>> +        md = pkt - sizeof *md;
> 
> We also want this sizeof-offset to be dynamic.
> And I hope that AF_XDP could provide this info, on the size of
> data_meta area.  I need some help from Magnus and Bjørn here?
> 
> Look at the kernel-side code of AF_XDP/XSK I don't see this being
> transferred? Help is this true? Can we add the info?
>

XDP sockets make sure that the meta data is propagated to userland.
However, the meta data *size* is not passed via the AF_XDP descriptors.

This was per-design; The metadata is a contract between the XDP program
and the XDP socket, and in the future maybe the driver as well.

So, we're simply stating that "your AF_XDP application should know if
there's meta data or not, and how it's structured".

I guess we could explore starting to use the options bits of struct
xdp_desc for this, but only if we're *really* seeing a need for it.

If the XDP program would like to pass it's length, it can do that via
the metadata area as well.

struct meta {
   int len;
   ...
   ...
};


> 
> 
>>           index = addr >> FRAME_SHIFT;
>>           xpacket = &umem->xpool.array[index];
>>           packet = &xpacket->packet;
>> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>                               OVS_XDP_HEADROOM);
>>           dp_packet_set_size(packet, len);
>>   
>> +        /* FIXME: This should be done by detecting whether
>> +         * XDP MD is enabled or not. Ex:
>> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>> +         */
>> +        dp_packet_set_rss_hash(packet, md->hash32);
> 
> The general idea to make this code more dynamic is to let AF_XDP
> provide the meta-data-len and then use the (proposed) btf_id to know
> what "callback" C-code to call.
>

Hmm, again, here I might have a bit of a different view. Personally I 
think it's better that all the "NIC/XDP/socket" does the negotiation in 
the control/setup phase, and keep the fast-path lean. Instead of:

recv(void *data)
{
    len = get_meta_data_len(data);
    if (len) {
       // check this, check that, if-else-if-else
    }
    ...
}

You'd like to use the length of metadata to express if it's available or 
not. I'm stating that we don't need that for AF_XDP. In XDP if we're 
accessing the field, we need to validate access due to the verifier. But 
it's not to know if it's actually there. When we attach a certain XDP to 
a device, we can query what offload it supports, and the the application 
knows that meta data is structured in a certain way.

Some application has simple metadata:
struct tstamp { u64 rx; };

That application will unconditionally look at (struct tstamp 
*)(pkt-sizeof(struct tstamp)->rx;.

Some might be more flexible and have:
struct metadata { char storage[64]; int btf_id; };

If you need an "availability check", then you can solve that.

int xdp_prog(struct xdp_md *ctx)
{
	struct metadata *m = (void *)(long)ctx->data_meta;
	void *data = (void *)(unsigned long)ctx->data;

         if ((m + 1) > data) {
		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
		if (ret < 0)
			return XDP_ABORTED;
		data = (void *)(unsigned long)ctx->data;
		m = (void *)(long)ctx->data_meta;
		if ((m + 1) > data)
			return XDP_ABORTED;
		m->btf_id = -1; // not valid
	}
	// ... bpf_redirect_map() to socket
}

Would you agree? Or maybe expand a bit why you need the length.



> Using a btf_id make this independent of the NIC driver.  And we can
> tell the difference between two BPF structs that contain the same info
> but at different offsets. (I'll leave out the details, but I can
> explain more if there is interest).
>

Details, please! :-D Maybe I'm just not following your ideas!


Cheers!
Björn

> The btf_id also helps when NIC reconfigure the meta-data contents, and
> there are packets in-flight.
>
> 
>> +
>>           /* Add packet into batch, increase batch->count. */
>>           dp_packet_batch_add(batch, packet);
>>   
> 
> 
> 


More information about the dev mailing list