[ovs-dev] [RFC v2 PATCH 4/4] bpf: Add reference XDP program implementation for netdev-offload-xdp

Mon May 11 14:53:52 UTC 2020

On 2020/05/09 0:05, William Tu wrote:
> On Thu, May 7, 2020 at 7:56 AM Toshiaki Makita
> <toshiaki.makita1 at gmail.com> wrote:
>>
>> On 2020/05/06 0:18, William Tu wrote:
>>> On Tue, May 05, 2020 at 11:43:58AM +0900, Toshiaki Makita wrote:
>>>> On 2020/05/04 1:22, William Tu wrote:
>>>>> On Tue, Apr 21, 2020 at 11:47:04PM +0900, Toshiaki Makita wrote:
>>>>>> This adds a reference program, flowtable_afxdp.o, which can be used to
>>>>>> offload flows to XDP through netdev-offload-xdp.
>>>>>> The program will be compiled when using --enable-bpf switch.
>>>>>>
>>>>>> Signed-off-by: Toshiaki Makita <toshiaki.makita1 at gmail.com>
>>>>>
>>>>> Hi Toshiaki,
>>>>>
>>>>> Thanks for your patch, I haven't tried to run it but I have
>>>>> questions about miniflow.
>>>>>
>>>>>> ---
>>>> ...
>>>>>> +SEC("xdp") int
>>>>>> +flowtable_afxdp(struct xdp_md *ctx)
>>>>>> +{
>>>>>> +    struct xdp_miniflow *pkt_mf;
>>>>>> +    struct xdp_subtable_mask *subtable_mask;
>>>>>> +    int *head;
>>>>>> +    struct xdp_flow_actions *xdp_actions = NULL;
>>>>>> +    struct nlattr *a;
>>>>>> +    unsigned int left;
>>>>>> +    int cnt, idx, zero = 0;
>>>>>> +
>>>>>> +    account_debug(0);
>>>>>> +
>>>>>> +    head = bpf_map_lookup_elem(&subtbl_masks_hd, &zero);
>>>>>> +    if (!head) {
>>>>>> +        return XDP_ABORTED;
>>>>>> +    }
>>>>>> +    if (*head == XDP_SUBTABLES_TAIL) {
>>>>>> +        /* Offload not enabled */
>>>>>> +        goto upcall;
>>>>>> +    }
>>>>>> +
>>>>>> +    /* Get temporary storage for storing packet miniflow */
>>>>>> +    pkt_mf = bpf_map_lookup_elem(&pkt_mf_tbl, &zero);
>>>>>> +    if (!pkt_mf) {
>>>>>> +        return XDP_ABORTED;
>>>>>> +    }
>>>>>> +
>>>>>
>>>>> I start to wonder what's the benefit of using miniflow in XDP?
>>>>> miniflow tries to compress the large flow key into smaller memory,
>>>>> and with a flowmap indicating which offset in bits are valid.
>>>>> And when adding a new subtable at flow_put, the subtable's key
>>>>> size can be smaller with only the needed fields.
>>>>>
>>>>> But in the case of XDP/eBPF, we have to statically allocated fixed
>>>>> key size (the subtbl_template has key as struct xdp_flow_key), so
>>>>> each subtable is always having key with full key fields. (not saving
>>>>> memory space) Moreover with miniflow, it makes the 'mask_key' function
>>>>> below pretty complicated.
>>>>
>>>> Fixed sized subtable is restriction of map-in-map.
>>>> The benefit to use miniflow I envisioned initially was
>>>> - compress the key
>>>> - use existing function to convert flows for XDP
>>>>
>>>> The first one is actually not doable due to map-in-map. I hope someday the
>>>> restriction gets loosen...
>>>
>>> On my second thought this might be a good idea.
>>>
>>> One problem I hit in my previous prototype is that the flow key is too big.
>>> ex: struct sw_flow_key in ovs kernel is more than 500B, and moving it to bpf
>>> some BPF limitation. If we continue adding more fields to struct xdp_flow,
>>> ex: ipv6, vxlan tunnel, then with miniflow extract, the actually memory usage
>>> in key is smaller. The key size in subtbl_template can be a value smaller the
>>> the size of struct xdp_flow.
>>
>> Nice idea!
>> So you mean we can have less sized subtbl_template key because when adding more and
>> more keys, we probably don't use *all* of keys at the same time?
> 
> Now I'm a little confused.
> If you see 'struct flow' in ovs, all the fields have its own space
> (ex: no union is used for
> ipv4 and ipv6). So using miniflow extract saves space by only use
> fields on the packet header.
> But if you see 'struct sw_flow_key' in ovs kernel, same layer of
> protocols are using union,
> (ex: ipv4 and ipv6 are in a union). So no extra space is wasted.
> So how we define the struct xdp_flow_key makes difference.

Thanks, now I see what you mean.
Actually I was thinking of further aggressive way to minimize subtables keys.
As we don't necessarily need to support all of key combinations, we can cap
the key size at a certain value.
For example, one flow may need dl_* and ip_* and another flow may need
tunneling keys etc, but requiring all of them, i.e. dl, ip, tp, ct, tun, ...
and every other key, is unlikely. So just cap the key size at a reasonable value, 
and in case some flow requires more size, we can just return error like EOPNOTSUPP 
for such flow.

What do you think?

> 
> Or here is another idea.
> We can also do on-demand parsing/key_extract here.
> 1) keep a list of key_fields = {empty}
> 2) when xdp_flow_put, we know which fields are needed to extract
>      ex: key_fields = {src_mac, dst_mac}
> 3) then at the xdp_miniflow_extract we can skip/return early after
>      the L2 fields are parsed.

So in this case we need to keep track of all of keys used in any subtable, right?
I'm a bit concerned about complexity introduced by this mechanism.

>>> But when adding more fields, I'm not sure whether we will hit some limitation
>>> of BPF at xdp_miniflow_extract.
>>> So miniflow can save key size but complicate the key extraction.
>>> Without miniflow, key size is large but key extract and match are simpler.
>>> What do you think?
>>
>> I cannot think of any limitation xdp_miniflow_extract will hit. Do you have insns
>> limit in mind?
> yes.

As xdp_miniflow_extract does not handle loop, I'm not so concerned.

Toshiaki Makita

> 
>> My main concern for miniflow is insns limit in mask_key(), but if we can have
>> smaller key size for subtbl_template then insns will be less too.
>> Although I should check how large key is acceptable for mask_key, I guess miniflow
>> is better considering we have more keys in the future.
>>
>>>> The second one is doable and it's good for userspace offload driver.
>>>> But I ended up with rewriting most helper functions of miniflow for BPF due
>>>> to some restriction.
>>>>
>>>>> What if we don't compress the flow key?
>>>>> unlike to OVS userspace, more like the kernel's implementation of OVS
>>>>> megaflow (ex: masked_flow_lookup in net/openvswitch/flow_table.c)
>>>>
>>>> Without miniflow the BPF program will be simple like the original xdp_flow
>>>> patch for Linux kernel.
>>>> I can try it.
>>>>
>>>>> Where it has a list of masks (like your 'subtbl_masks' map) but
>>>>> only use 1 hash table (so we don't need map-in-map).
>>>>> We lookup the hash table multiple times, each
>>>>> time applied different masks in the mask list until finding a match.
>>>>
>>>> For HW offload one big hash table may help. Fow software it will be slower
>>>> as key size and entries are larger.
>>>> Maybe we can support both types, one big hash table and multiple subtables,
>>>> in the future?
>>>>
>>>
>>> For one big hash table, do you mean adding another cache before megaflow
>>> like EMC (exact match cache)?
>>
>> No, I meant replacing combination of current flow_table and multiple subtbls with
>> one big hash table, like ovs kernel module.
> 
> I see, thanks
> William
>