[ovs-dev] [PATCH v12 00/11] Add offload support for sFlow

Chris Mi cmi at nvidia.com
Wed Mar 24 13:03:02 UTC 2021


On 3/24/2021 8:14 PM, Ilya Maximets wrote:
> On 3/24/21 10:17 AM, Chris Mi wrote:
>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>> Hi Ilya,
>>>>
>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>> Please see my reply below:
>>>>
>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>> It has been posted in the community for a long time 😁
>>>>> In general, the way to get your patches reviewed is to review other
>>>>> patches.  It's simply because we still have a huge review backlog
>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>> patches you're reducing amount of work for maintainers so they can
>>>>> get to your patches faster.
>>>> OK, I see.
>>>>> For the series and comments from Eelco:
>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>> do not understand why we need a separate thread to poll psample events.
>>>>> Why can't we just allow usual handler threads to do that?
>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>> netlink messages are different.
>>> Hi.  Sorry for late reply.
>>>
>>> Yes, I understand that message format is different, but it doesn't really
>>> matter.  All the message-specific handling and parsing should happen
>>> inside the netdev_offload_tc_recv().  This function should have a prototype
>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>>> type of a generic data structure if it's really not possible to construct
>>> struct dpif_upcall.
>>>
>>>
>>>>>      From the
>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>> from the offload-provider.  This introduces lots of complications
>>>>> and might cause lots of issues in the future.
>>>>>
>>>>> I'd say that design should look like this:
>>>>>
>>>>>      handler thread ->
>>>>>        dpif_recv() ->
>>>>>          dpif_netlink_recv() ->
>>>>>            netdev_offload_recv() ->
>>>>>              netdev_offload_tc_recv() ->
>>>>>                nl_sock_recv()
>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>> Last three calls should be implemented.
>>>>>
>>>>> The better version of this will be to throw away dpif part from
>>>>> above call chain and make it:
>>>>>
>>>>>      handler thread ->
>>>>>        netdev_offload_recv() ->
>>>>>          netdev_offload_tc_recv() ->
>>>>>            nl_sock_recv()
>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>> somehow.
>>> There is no need to implement any epoll_fd logic for psample.
>>> You may only use handler with handler_id == 0.  Only this handler will receive
>>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>> There is no need to block and you're never blocking anywhere because your're
>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>> actually receiving a message.
>> Hi Ilya,
>>
>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>
>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>> psample packet as soon as possible. That means we can only receive the sampled packet
>> when there is a miss upcall.
>>
>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>> the expert of the epoll API, so I'm not sure if I missed anything.
> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
> of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
> loop by
>    poll_fd_wait(handler->epoll_fd, POLLIN);
>
> If you will call nl_sock_wait() on psample socket, this socket will be added
> to the same poll loop with:
>    nl_sock_wait(psample_sock, POLLIN) ->
>      poll_fd_wait(psample_sock->fd, POLLIN);
>
> This way psample socket will become an *additional* source for waking up
> for the thread that called nl_sock_wait().  So, this handler will be waken
> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>
> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
> threads knows nothing about it.  poll_loop() knows nothing about this epoll
> stuff too, it just adds epoll_fd itself to the list of fds for the usual
> poll() because of the poll_fd_wait() call.  psample socket will end up in
> the same usual poll().
Thank a lot for your detailed explanation, Ilya. Now I'm clear of it. I 
will work on it.

-Chris
>> Thanks,
>> Chris
>>> So, there should be several call chains:
>>>
>>> 1. Init.
>>>
>>>      open_dpif_backer/type_run() ->
>>>        netdev_offload_recv_set() ->
>>>          netdev_offload_tc_recv_set(enabled) ->
>>>            if (enable)
>>>              psample_sock = nl_sock_create(...);
>>>            else
>>>              close(psample_sock);
>>>
>>> 2. Wait.
>>>
>>>      udpif_upcall_handler() ->
>>>        netdev_offload_recv_wait() ->
>>>          netdev_offload_tc_recv_wait() ->
>>>            if (handler_id == 0)
>>>              nl_sock_wait(psample_sock, POLLIN);
>>>
>>> 3. Receive.
>>>
>>>      udpif_upcall_handler() ->
>>>        recv_upcalls() ->
>>>          netdev_offload_recv() ->
>>>          | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>>          |   if (handler_id == 0)
>>>          |     nl_sock_recv(psample_socket, ..., wait = false);
>>>          |     *upcall = <received data>;
>>>          upcall_receive()
>>>          process_upcall() ->
>>>            dpif_sflow_received()
>>>
>>> 4. Deinit.
>>>
>>>      close_dpif_backer() ->
>>>        netdev_offload_recv_set(enable = false);
>>>
>>>
>>> Does that look more clear?
>>>
>>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>>> So if you have time, could you please elaborate?
>>>>
>>>> Thanks,
>>>> Chris
>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>> userspace datapath.
>>>>>
>>>>> Above architecture also implies implementation of:
>>>>>     - netdev_offload_recv_wait()
>>>>>     - netdev_offload_recv_purge()
>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>
>>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>
>>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>
>>>>>>>> v2-v1:
>>>>>>>> - Fix robot errors.
>>>>>>>> v3-v2:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>> - Add travis test result.
>>>>>>>> v4-v3:
>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>> v5-v4:
>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>> v6-v5:
>>>>>>>> - Rebase.
>>>>>>>> - Add GitHub Actions test result.
>>>>>>>> v7-v6:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>> v8-v7
>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>> v9-v8
>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>> v10-v9
>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>> v11-v10
>>>>>>>> - Fix a bracing error.
>>>>>>>> v12-v11
>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>
>>>>>>>> Chris Mi (11):
>>>>>>>>       compat: Add psample and tc sample action defines for older kernels
>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>       netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>>       dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>
>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>      lib/dpif-netdev.c                |   1 +
>>>>>>>>      lib/dpif-netlink.c               |  27 ++
>>>>>>>>      lib/dpif-netlink.h               |   4 +
>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>      lib/dpif.c                       |   8 +
>>>>>>>>      lib/dpif.h                       |  23 ++
>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>      lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>      lib/tc.c                         |  61 ++-
>>>>>>>>      lib/tc.h                         |  16 +-
>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Can you help review for this series?
>>>>>>> do you have any comments you want us to handle?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Roi



More information about the dev mailing list