[ovs-dev] [PATCH v4 0/9] Add offload support for sFlow

Sun Oct 11 08:03:06 UTC 2020

Hi Ilya, 
please see inline

>-----Original Message-----
>From: Chris Mi <cmi at nvidia.com>
>Sent: Friday, September 25, 2020 3:40 PM
>To: Roni Bar Yanai <roniba at nvidia.com>; i.maximets at ovn.org;
>dev at openvswitch.org
>Cc: sriharsha.basavapatna at broadcom.com; Eli Britstein <elibr at nvidia.com>;
>hemal.shah at broadcom.com; ian.stokes at intel.com; u9012063 at gmail.com;
>simon.horman at netronome.com
>Subject: Re: [ovs-dev] [PATCH v4 0/9] Add offload support for sFlow
>
>+ Roni
>
>On Thu, 2020-09-24 at 17:56 +0200, Ilya Maximets wrote:
>> On 9/24/20 12:24 PM, Chris Mi wrote:
>> > This patch set adds offload support for sFlow.
>> >
>> > Psample is a genetlink channel for packet sampling. TC action
>> > act_sample uses psample to send sampled packets to userspace.
>> >
>> > When offloading sample action to TC, userspace creates a unique ID
>> > to map sFlow action and tunnel info and passes this ID to kernel
>> > instead of the sFlow info. psample will send this ID and sampled
>> > packet to userspace. Using the ID, userspace can recover the sFlow
>> > info and send sampled packet to the right sFlow monitoring host.
>>
>> Hi.  Thanks for working on this.
>>
>> I read through the implementation really roughly and I have a few
>> questions about the feature in general.  And also a big concern about
>implementation.
>>
>> The main issue that I see is that current implementation is tightly
>> coupled with kernel datapath and doesn't consider userspace datapath
>> at all even in terms of prohibiting support of sampling for userspace datapath.
>> Also dpif_provider and netdev_offload API relies on psample as a base
>> primitive and this will not allow us to reuse this API once we will
>> have offload support with rte_flow in netdev-offload-dpdk or dummy.
>>
>> So, while designing the feature implementation following concepts
>> should be taken into consideration:
>> 1. netdev-offload-tc could be used by userspace datapath and it should be
>>    possible to use it correctly (it might be possible at least with
>>    skip-sw right now).
>> 2. High-level dpif API should not depend on the implementation and type of
>>    the offload provider or datapath and should work correctly with all
>>    combinations.  (we should think on how current sampling infrastructure
>>    could work with rte_flow sampling implementation in the future).
>>

Right. The implementation should be agnostic to the data plane. I think
maybe having a dedicate sample up-call function pointer added to dpif-provider.
It will be called directly from data path (dpif-netlink in this case) and can be shared 
between Implementations. HW offload using TC in user space has an implementation
Issues (to my opinion). For example,  when offloading conntrack using TC the kernel 
CT is accessed while user space has its own implementation. 
Other issues are passing information in case offload is not complete, such as mark. 
Not clear if we can do that. I think maybe user space offload using TC should be limited 
to flows that are single end to end (packet is processed using a single rule).

>> This leads to some questions about the feature itself:
>>
>> How does HW deliver packets to kernel that these packets appears on
>> psample socket instead of regular upcall sockets?  Is it a separate HW
>> rx queue or packets has special marks?
>>
>> How will it work if we will have smaple() action in HW and AF_XDP
>> socket with generic XDP program assigned?  Will packet be delivered
>> via this AF_XDP socket to userspace or it will still be placed into
>> psample socket? (this depends on how HW marks these packets and where
>> it places them) Is it possible for XDP program to determine if this
>> packet sampled or it just missed HW flow?

XDP and TC currently are not in sync. While in SW, XDP will always happen first,
when using HW offload, the HW offload happens first, so TC and XDP are done in 
reverse way. Note that AF_XDP has no meta data right now, so there is no way to 
mark packets for user space (however, I think it might be in the future same as DPDK 
has meta data,  I think it is must). This conflict makes the behavior of sampled packet
in HW and XDP undefined (or implementation specific). Since XDP executed before 
skb creation, sampled packet would not be marked for XDP code and user space AF_XDP 
might get the sampled packet instead of the PSAMPLE channel without any mark.
(although this might be considered as bug, and defining the sampled packet will
 happen before XDP)

Maybe use cases should be completely separated. In case of AF_XDP, the AF_XDP
can be looked as a port representor. We offload rules using TC, expecting that
TC SW is not in the game. All misses in HW will get to AF_XDP, this will make the 
behavior very similar to DPDK.  Of course, this requires meta data, what do think?

>> I see the proposal for rte_flow to support sample action and IIUC it
>> states to assign a mark and enqueue the packet to some rx queue, so
>> the OVS will have to match these special marks and handle these packets
>differently, i.e.
>> execute SFLOW upcall instead of MISS upcall, but this will likely
>> still happen inside the PMD thread and these packets might be passed
>> to usual handler threads for further processing, probably, but I'm not sure how
>yet.
>> And it's hard to tell how the uniform implementation that will
>> consider all combinations of datapaths and offload providers should
>> look like, but that is something that we should think about.
>>
>> Does that make sense?  Please, share your thoughts.
>>

Sampled packet requires fresh approach form HW offload perspective.
sampled packet cannot enter the data path, even with a special mark.
 Looking on Linux It doesn't make sense to enter the linux Stack with sampled packet. 
this will require a special flag on SKB and having this flag inspected all the way.
Looking in OVS user space, we have similar issues, since packet might be sampled in 
the middle (not first flow), so we need to recover all states, and even then a special code 
in OVS will have to Identify it as sampled and do only the sample action in the specific flow
, without counting it for example.
Sample has a cookie which is hard or not possible to offload, need to be recovered in SW.

I think similar approach should be taken on all implementations, meaning that before offload
we keep the cookie in SW and the flow other meta data (tun ...etc), and give the sample an id.
HW sample packet should reach a dedicate handler with that id. In kernel this is done using psample, 
in dpdk we might need a dedicated queue per port, or having a special mark that will make data 
path call this hander before anything else or queue the packet to that handler.

Handler will read the packets from the queue. in kernel it read from psmaple channel, in dpdk it will
read from dedicated queues or OVS queue. The handler will restore all the data and will generate the 
up-call. For the up-call the dedicate function pointer will be used.
I think such approach will have a lot of shared implementation. What do you think?

>> Best regards, Ilya Maximets.