[ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

Toshiaki Makita toshiaki.makita1 at gmail.com
Mon Feb 22 14:44:17 UTC 2021


On 2021/02/16 10:49, William Tu wrote:
> On Tue, Feb 9, 2021 at 1:39 AM Toshiaki Makita
> <toshiaki.makita1 at gmail.com> wrote:
>>
>> On 2021/02/05 2:36, William Tu wrote:
>>> Hi Toshiaki,
>>>
>>> Thanks for the patch. I've been testing it for a couple days.
>>> I liked it a lot! The compile and build process all work without any issues.
>>
>> Hi, thank you for reviewing!
>> Sorry for taking time to reply. It took time to remember every detail of the patch set...
>>
>>> On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
>>> <toshiaki.makita1 at gmail.com> wrote:
>>>>
>>>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>>>> flow API provider.  When an OVS device with XDP offload enabled,
>>>> packets first are processed in the XDP flow cache (with parse, and
>>>> table lookup implemented in eBPF) and if hits, the action processing
>>>> are also done in the context of XDP, which has the minimum overhead.
>>>>
>>>> This provider is based on top of William's recently posted patch for
>>>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>>>> the program supports classifier, and if supported it starts offloading
>>>> flows to the XDP program.
>>>>
>>>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>>>> this but implemented in kernel.
>>>>
>>>>
>>>> * Motivation
>>>>
>>>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>>>> performance, there are use cases where packets better to be processed in
>>>> kernel, for example, TCP/IP connections, or container to container
>>>> connections.  Current solution is to use tap device or af_packet with
>>>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>>>> is to steer packets earlier in the XDP program, and decides to send to
>>>> userspace datapath or stay in kernel.
>>>>
>>>> One problem with current netdev-afxdp is that it forwards all packets to
>>>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>>>> program.) only provides the interface to load XDP program, howerver users
>>>> usually don't know how to write their own XDP program.
>>>>
>>>> XDP also supports HW-offload so it may be possible to offload flows to
>>>> HW through this provider in the future, although not currently.
>>>> The reason is that map-in-map is required for our program to support
>>>> classifier with subtables in XDP, but map-in-map is not offloadable.
>>>> If map-in-map becomes offloadable, HW-offload of our program may also
>>>> be possible.
>>>
>>> I think it's too far away for XDP to be offloaded into HW and meet OVS's
>>> feature requirements.
>>
>> I don't know blockers other than map-in-map, but probably there are more.
>> If you can provide explicit blockers I can add them in the cover letter.
> 
> It's hard to list them when we don't have a full OVS datapath
> implemented in XDP.
> Here are a couple things I can imagine. How does HW offloaded support:
> - AF_XDP socket. The XSK map contains XSK fd, how does it exchange
>    the fd to host kernel?
> - how does offloaded XDP redirect to another netdev
> - Helper functions such as adjust_head for pushing the outer header.

Thanks, this is helpful. will add them.

>>> There is a research prototype here, FYI.
>>> https://www.usenix.org/conference/osdi20/presentation/brunella
>>
>> This is a presentation about FPGA, not HW offload to SmartNIC, right?
>>
> Yes, that's for offloading to FPGA.
> 
>>>>
>>>>
>>>> * How to use
>>>>
>>>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>>>      kernel >= 5.3.
>>>>
>>>> 2. make with --enable-afxdp --enable-xdp-offload
>>>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
>>>
>>> typo: I think you mean --enable-xdp-offload
>>
>> Thanks.
>>
>>>
>>>> the BPF object will not be installed anywhere by "make install" at this point.
>>>>
>>>> 3. Load custom XDP program
>>>> E.g.
>>>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>>
>>>> 4. Enable XDP_REDIRECT
>>>> If you use veth devices, make sure to load some (possibly dummy) programs
>>>> on the peers of veth devices. This patch set includes a program which
>>>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>>>> this:
>>>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
>>>
>>> I'd suggest not using "veth1" as an example, because in (3) above, people
>>> might think "veth1" is already attached to ovsbr0.
>>> IIUC, here your "veth1" should be the device at the peer inside
>>> another namespace.
>>
>> Sure, will rename it.
>>
>>>>
>>>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>>>> queues using "ethtool -L".
>>>>
>>>> 5. Enable hw-offload
>>>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>>>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>>>> This will starts offloading flows to the XDP program.
>>>>
>>>> You should be able to see some maps installed, including "debug_stats".
>>>> $ bpftool map
>>>>
>>>> If packets are successfully redirected by the XDP program,
>>>> debug_stats[2] will be counted.
>>>> $ bpftool map dump id <ID of debug_stats>
>>>>
>>>> Currently only very limited keys and output actions are supported.
>>>> For example NORMAL action entry and IP based matching work with current
>>>> key support. VLAN actions used by port tag/trunks are also supported.
>>>>
>>>
>>> I don't know if this is too much to ask for.
>>> I wonder if you, or we can work together, to add at least a tunnel
>>> support, ex: vxlan?
>>> The current version is a good prototype for people to test an L2/L3
>>> XDP offload switch,
>>> but without a good use case, it's hard to attract more people to
>>> contribute or use it.
>>
>> I think we have discussed this before.
>> Vxlan or other tunneling is indeed important, but that's not straightforward.
>> Push is easy, but pop is not. Pop requires two rules and recirculation.
>> Recirculation is highly likely to cause eBPF 1M insn limit error.
> 
> Recirculation is pretty important. For example connection tracking also
> relies on recirc. Can we break into multiple program and tail call?
> For recirc action, can we tail call the main ebpf program, and let the
> packet goes through parse/megaflow lookup/action?

OK, will try using tail calls.
This will require vswitchd to load another bpf program for tail calls.
I guess such a program can be specified in main bpf program meta data.
I'll check if it works.

>> One possible solution is to combine two rules into one and insert it instead,
> I think it's hard to combine two rules into one.
> Because after recirc, the flow might match a variety of flows, and you
> basically need to do a across product of all matching flows.
> 
>> but I have not verified whether this can work or how to implement it.
>> Can we leave this and make following patches later?
> I'm not sure how others think. But I think it's better that in the first
> design, there is at least a simple vxlan or recirc support so people
> are more willing to try it and give feedback.
> I also expect there will be huge change and new issues when
> later on, we add tunnel support. Ex: the flow key structure will increase
> and we need to break into smaller bpf programs.

Yes we should determine how to avoid insns inflation anyway.
Will check the verifier logic again...

Thanks,
Toshiaki Makita


More information about the dev mailing list