[ovs-dev] [PATCH v4 0/5] XDP offload using flow API provider

Mon Feb 22 14:44:17 UTC 2021

On 2021/02/16 10:49, William Tu wrote:
> On Tue, Feb 9, 2021 at 1:39 AM Toshiaki Makita
> <toshiaki.makita1 at gmail.com> wrote:
>>
>> On 2021/02/05 2:36, William Tu wrote:
>>> Hi Toshiaki,
>>>
>>> Thanks for the patch. I've been testing it for a couple days.
>>> I liked it a lot! The compile and build process all work without any issues.
>>
>> Hi, thank you for reviewing!
>> Sorry for taking time to reply. It took time to remember every detail of the patch set...
>>
>>> On Thu, Jul 30, 2020 at 7:55 PM Toshiaki Makita
>>> <toshiaki.makita1 at gmail.com> wrote:
>>>>
>>>> This patch adds an XDP-based flow cache using the OVS netdev-offload
>>>> flow API provider.  When an OVS device with XDP offload enabled,
>>>> packets first are processed in the XDP flow cache (with parse, and
>>>> table lookup implemented in eBPF) and if hits, the action processing
>>>> are also done in the context of XDP, which has the minimum overhead.
>>>>
>>>> This provider is based on top of William's recently posted patch for
>>>> custom XDP load.  When a custom XDP is loaded, the provider detects if
>>>> the program supports classifier, and if supported it starts offloading
>>>> flows to the XDP program.
>>>>
>>>> The patches are derived from xdp_flow[1], which is a mechanism similar to
>>>> this but implemented in kernel.
>>>>
>>>>
>>>> * Motivation
>>>>
>>>> While userspace datapath using netdev-afxdp or netdev-dpdk shows good
>>>> performance, there are use cases where packets better to be processed in
>>>> kernel, for example, TCP/IP connections, or container to container
>>>> connections.  Current solution is to use tap device or af_packet with
>>>> extra kernel-to/from-userspace overhead.  But with XDP, a better solution
>>>> is to steer packets earlier in the XDP program, and decides to send to
>>>> userspace datapath or stay in kernel.
>>>>
>>>> One problem with current netdev-afxdp is that it forwards all packets to
>>>> userspace, The first patch from William (netdev-afxdp: Enable loading XDP
>>>> program.) only provides the interface to load XDP program, howerver users
>>>> usually don't know how to write their own XDP program.
>>>>
>>>> XDP also supports HW-offload so it may be possible to offload flows to
>>>> HW through this provider in the future, although not currently.
>>>> The reason is that map-in-map is required for our program to support
>>>> classifier with subtables in XDP, but map-in-map is not offloadable.
>>>> If map-in-map becomes offloadable, HW-offload of our program may also
>>>> be possible.
>>>
>>> I think it's too far away for XDP to be offloaded into HW and meet OVS's
>>> feature requirements.
>>
>> I don't know blockers other than map-in-map, but probably there are more.
>> If you can provide explicit blockers I can add them in the cover letter.
> 
> It's hard to list them when we don't have a full OVS datapath
> implemented in XDP.
> Here are a couple things I can imagine. How does HW offloaded support:
> - AF_XDP socket. The XSK map contains XSK fd, how does it exchange
>    the fd to host kernel?
> - how does offloaded XDP redirect to another netdev
> - Helper functions such as adjust_head for pushing the outer header.

Thanks, this is helpful. will add them.

>>> There is a research prototype here, FYI.
>>> https://www.usenix.org/conference/osdi20/presentation/brunella
>>
>> This is a presentation about FPGA, not HW offload to SmartNIC, right?
>>
> Yes, that's for offloading to FPGA.
> 
>>>>
>>>>
>>>> * How to use
>>>>
>>>> 1. Install clang/llvm >= 9, libbpf >= 0.0.6 (included in kernel 5.5), and
>>>>      kernel >= 5.3.
>>>>
>>>> 2. make with --enable-afxdp --enable-xdp-offload
>>>> --enable-bpf will generate XDP program "bpf/flowtable_afxdp.o".  Note that
>>>
>>> typo: I think you mean --enable-xdp-offload
>>
>> Thanks.
>>
>>>
>>>> the BPF object will not be installed anywhere by "make install" at this point.
>>>>
>>>> 3. Load custom XDP program
>>>> E.g.
>>>> $ ovs-vsctl add-port ovsbr0 veth0 -- set int veth0 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>> $ ovs-vsctl add-port ovsbr0 veth1 -- set int veth1 options:xdp-mode=native \
>>>>     options:xdp-obj="/path/to/ovs/bpf/flowtable_afxdp.o"
>>>>
>>>> 4. Enable XDP_REDIRECT
>>>> If you use veth devices, make sure to load some (possibly dummy) programs
>>>> on the peers of veth devices. This patch set includes a program which
>>>> does nothing but returns XDP_PASS. You can use it for the veth peer like
>>>> this:
>>>> $ ip link set veth1 xdpdrv object /path/to/ovs/bpf/xdp_noop.o section xdp
>>>
>>> I'd suggest not using "veth1" as an example, because in (3) above, people
>>> might think "veth1" is already attached to ovsbr0.
>>> IIUC, here your "veth1" should be the device at the peer inside
>>> another namespace.
>>
>> Sure, will rename it.
>>
>>>>
>>>> Some HW NIC drivers require as many queues as cores on its system. Tweak
>>>> queues using "ethtool -L".
>>>>
>>>> 5. Enable hw-offload
>>>> $ ovs-vsctl set Open_vSwitch . other_config:offload-driver=linux_xdp
>>>> $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>>>> This will starts offloading flows to the XDP program.
>>>>
>>>> You should be able to see some maps installed, including "debug_stats".
>>>> $ bpftool map
>>>>
>>>> If packets are successfully redirected by the XDP program,
>>>> debug_stats[2] will be counted.
>>>> $ bpftool map dump id <ID of debug_stats>
>>>>
>>>> Currently only very limited keys and output actions are supported.
>>>> For example NORMAL action entry and IP based matching work with current
>>>> key support. VLAN actions used by port tag/trunks are also supported.
>>>>
>>>
>>> I don't know if this is too much to ask for.
>>> I wonder if you, or we can work together, to add at least a tunnel
>>> support, ex: vxlan?
>>> The current version is a good prototype for people to test an L2/L3
>>> XDP offload switch,
>>> but without a good use case, it's hard to attract more people to
>>> contribute or use it.
>>
>> I think we have discussed this before.
>> Vxlan or other tunneling is indeed important, but that's not straightforward.
>> Push is easy, but pop is not. Pop requires two rules and recirculation.
>> Recirculation is highly likely to cause eBPF 1M insn limit error.
> 
> Recirculation is pretty important. For example connection tracking also
> relies on recirc. Can we break into multiple program and tail call?
> For recirc action, can we tail call the main ebpf program, and let the
> packet goes through parse/megaflow lookup/action?

OK, will try using tail calls.
This will require vswitchd to load another bpf program for tail calls.
I guess such a program can be specified in main bpf program meta data.
I'll check if it works.

>> One possible solution is to combine two rules into one and insert it instead,
> I think it's hard to combine two rules into one.
> Because after recirc, the flow might match a variety of flows, and you
> basically need to do a across product of all matching flows.
> 
>> but I have not verified whether this can work or how to implement it.
>> Can we leave this and make following patches later?
> I'm not sure how others think. But I think it's better that in the first
> design, there is at least a simple vxlan or recirc support so people
> are more willing to try it and give feedback.
> I also expect there will be huge change and new issues when
> later on, we add tunnel support. Ex: the flow key structure will increase
> and we need to break into smaller bpf programs.

Yes we should determine how to avoid insns inflation anyway.
Will check the verifier logic again...

Thanks,
Toshiaki Makita