[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

Björn Töpel bjorn.topel at intel.com
Fri Mar 5 15:34:51 UTC 2021


On 2021-03-05 16:07, Jesper Dangaard Brouer wrote:
> On Fri, 5 Mar 2021 14:56:02 +0100
> Björn Töpel <bjorn.topel at intel.com> wrote:
> 
>> On 2021-03-05 11:13, Jesper Dangaard Brouer wrote:


[...]

>>> We also want this sizeof-offset to be dynamic.
>>> And I hope that AF_XDP could provide this info, on the size of
>>> data_meta area.  I need some help from Magnus and Bjørn here?
>>>
>>> Look at the kernel-side code of AF_XDP/XSK I don't see this being
>>> transferred? Help is this true? Can we add the info?
>>>   
>>
>> XDP sockets make sure that the meta data is propagated to userland.
>> However, the meta data *size* is not passed via the AF_XDP descriptors.
>>
>> This was per-design; The metadata is a contract between the XDP program
>> and the XDP socket, and in the future maybe the driver as well.
>>
>> So, we're simply stating that "your AF_XDP application should know if
>> there's meta data or not, and how it's structured".
>>
>> I guess we could explore starting to use the options bits of struct
>> xdp_desc for this, but only if we're *really* seeing a need for it.
>>
>> If the XDP program would like to pass it's length, it can do that via
>> the metadata area as well.
>>
>> struct meta {
>>     int len;
> 
> I'm not interested in having the len part of metadata.   The btf_id
> will basically/indirectly give us the length.
>

Ok!


>>     ...
>>     ...
>> };
>>
>>
>>>
>>>    
>>>>            index = addr >> FRAME_SHIFT;
>>>>            xpacket = &umem->xpool.array[index];
>>>>            packet = &xpacket->packet;
>>>> @@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>>>>                                OVS_XDP_HEADROOM);
>>>>            dp_packet_set_size(packet, len);
>>>>    
>>>> +        /* FIXME: This should be done by detecting whether
>>>> +         * XDP MD is enabled or not. Ex:
>>>> +         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
>>>> +         */
>>>> +        dp_packet_set_rss_hash(packet, md->hash32);
>>>
>>> The general idea to make this code more dynamic is to let AF_XDP
>>> provide the meta-data-len and then use the (proposed) btf_id to know
>>> what "callback" C-code to call.
>>>   
>>
>> Hmm, again, here I might have a bit of a different view. Personally I
>> think it's better that all the "NIC/XDP/socket" does the negotiation in
>> the control/setup phase, and keep the fast-path lean. Instead of:
>>
>> recv(void *data)
>> {
>>      len = get_meta_data_len(data);
>>      if (len) {
>>         // check this, check that, if-else-if-else
>>      }
>>      ...
>> }
>>
>> You'd like to use the length of metadata to express if it's available or
>> not. I'm stating that we don't need that for AF_XDP. In XDP if we're
>> accessing the field, we need to validate access due to the verifier. But
>> it's not to know if it's actually there. When we attach a certain XDP to
>> a device, we can query what offload it supports, and the the application
>> knows that meta data is structured in a certain way.
>>
>> Some application has simple metadata:
>> struct tstamp { u64 rx; };
>>
>> That application will unconditionally look at (struct tstamp
>> *)(pkt-sizeof(struct tstamp)->rx;.
> 
> I imagine that NIC hardware will/can provide packets with different
> metadata.  Your example with timestamps are a good example, as some
> drivers/HW only support timestamping PTP packets.  Thus, I don't want
> to provide metadata info on tstamp if HW didn't provide it (and I also
> don't want to spend CPU-cycles on clearing the u64 field with zero).
> Another example is vlan IDs is sometimes provided by hardware.
> 
> Thus, on a per packet basis the metadata can change.
>

Yes, I agree. For a certain set of packets, there can be different
metadata per packet that, as you pointed out, can be dispatched do
different handlers.

>> Some might be more flexible and have:
>> struct metadata { char storage[64]; int btf_id; };
> 
> I really like the idea of placing the btf_id as the last entry, that
> way we know its location.  I can get the btf_id via packet minus-offset
> sizeof(int).
> 
>>
>> If you need an "availability check", then you can solve that.
>>
>> int xdp_prog(struct xdp_md *ctx)
>> {
>> 	struct metadata *m = (void *)(long)ctx->data_meta;
>> 	void *data = (void *)(unsigned long)ctx->data;
>>
>>           if ((m + 1) > data) {
>> 		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(*));
>> 		if (ret < 0)
>> 			return XDP_ABORTED;
>> 		data = (void *)(unsigned long)ctx->data;
>> 		m = (void *)(long)ctx->data_meta;
>> 		if ((m + 1) > data)
>> 			return XDP_ABORTED;
>> 		m->btf_id = -1; // not valid
>> 	}
>> 	// ... bpf_redirect_map() to socket
>> }
>>
>> Would you agree? Or maybe expand a bit why you need the length.
> 
> With the btf_id as the last entry, I basically don't need the length.
>

Ok! Good! :-)

> The dispatcher function, will based on the btf_id know the size of the
> struct.  The C-code will basically type-cast the struct at the pointer
> offset and get access to the metadata at the known offsets.
>

Make sense, and inline with my view!


>>
>>> Using a btf_id make this independent of the NIC driver.
> 
> E.g. we don't need to dereference the NIC driver struct to find what
> metadata this driver have.  Instead this is given directly by the
> btf_id, and two drivers can use the same btf_id.
>

Yeah!


> 
>>> And we can
>>> tell the difference between two BPF structs that contain the same info
>>> but at different offsets. (I'll leave out the details, but I can
>>> explain more if there is interest).
>>>   
> 
> BTF sort-of make the struct names identifiers and API.  And gives *BPF*
> programs the ability to adjust offsets, when loading the BPF-prog into
> the kernel.
> 
> Driver-1 choose:
> 
>   struct xdp_md_desc {
>       uint32_t flow_mark;
>       uint32_t hash32;
>   } __attribute__((preserve_access_index));
> 
> Driver-2 choose:
> 
>   struct xdp_md_desc {
>       uint32_t hash32;
>       uint32_t flow_mark;
>   } __attribute__((preserve_access_index));
> 
> The BPF code can access the members, and kernel will remap-struct
> access, and internally in the kernel create two different BPF-programs
> with adjusted offsets in the byte-code.
> 
> How do we handle this in the OVS userspace C-code?
>

Longer-term, I think extending the linkers to support struct member 
relocations would be a good thing! I.e. simply support 
__attribute__((preserve_access_index)) in userland. That would enable 
very powerful userland/kernel interface, like what we're discussing now.

But I wonder if it's too big of an initial undertaking to do that as a
first step? Maybe non-relocatable struct could be a start.

I definitely agree that the goal should be relocatable structs!

> Maybe/hopefully you have a better idea than mine, which is simply to
> compile two C-programs (or program for-each know BTF-id) and dispatch
> based on the BTF-id.
>

Not really. :-( A solution w/o __attribute__((preserve_access_index))
would mean that the application would need per-nic structures, or a
run-time/recompile component.

I'm excited to see that William and you are experimenting in the space.
We need more "pasta on the wall" and see what sticks^Wwork.

> 
>> Details, please! :-D Maybe I'm just not following your ideas!
> 
> I hope the details above helped.
>

Yes. Thank you for the write-up!


Björn

>   
>>> The btf_id also helps when NIC reconfigure the meta-data contents, and
>>> there are packets in-flight.
> 
> This is another reason why C-code need to handle meta-data can change
> runtime, as I don't like to freeze config changes to NIC.
> 


More information about the dev mailing list