[ovs-dev] [PATCH v12 00/11] Add offload support for sFlow

Ilya Maximets i.maximets at ovn.org
Wed Mar 24 12:14:42 UTC 2021


On 3/24/21 10:17 AM, Chris Mi wrote:
> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>> Hi Ilya,
>>>
>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>> Please see my reply below:
>>>
>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>> Hi Simon, Ilya,
>>>>>
>>>>> Could I know what should we do to make progress for this patch set?
>>>>> It has been posted in the community for a long time 😁
>>>> In general, the way to get your patches reviewed is to review other
>>>> patches.  It's simply because we still have a huge review backlog
>>>> (214 patches right now in patchwork and most of them needs review)
>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>> patches you're reducing amount of work for maintainers so they can
>>>> get to your patches faster.
>>> OK, I see.
>>>> For the series and comments from Eelco:
>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>> do not understand why we need a separate thread to poll psample events.
>>>> Why can't we just allow usual handler threads to do that?
>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>> netlink messages are different.
>> Hi.  Sorry for late reply.
>>
>> Yes, I understand that message format is different, but it doesn't really
>> matter.  All the message-specific handling and parsing should happen
>> inside the netdev_offload_tc_recv().  This function should have a prototype
>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>> type of a generic data structure if it's really not possible to construct
>> struct dpif_upcall.
>>
>>
>>>>     From the
>>>> architecture perspective it's not a good thing to call ofproto code
>>>> from the offload-provider.  This introduces lots of complications
>>>> and might cause lots of issues in the future.
>>>>
>>>> I'd say that design should look like this:
>>>>
>>>>     handler thread ->
>>>>       dpif_recv() ->
>>>>         dpif_netlink_recv() ->
>>>>           netdev_offload_recv() ->
>>>>             netdev_offload_tc_recv() ->
>>>>               nl_sock_recv()
>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>> allocate a special port number for psample and assign it to event.data.u32.
>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>> Last three calls should be implemented.
>>>>
>>>> The better version of this will be to throw away dpif part from
>>>> above call chain and make it:
>>>>
>>>>     handler thread ->
>>>>       netdev_offload_recv() ->
>>>>         netdev_offload_tc_recv() ->
>>>>           nl_sock_recv()
>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>> somehow.
>> There is no need to implement any epoll_fd logic for psample.
>> You may only use handler with handler_id == 0.  Only this handler will receive
>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>> There is no need to block and you're never blocking anywhere because your're
>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>> actually receiving a message.
> Hi Ilya,
> 
> With this new design, the code will be changed greatly. I'm not unwilling to change it.
> The effort is not small. So before I start, I want to make sure that it is feasible.
> 
> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
> psample packet as soon as possible. That means we can only receive the sampled packet
> when there is a miss upcall.
> 
> I added some debug message in ovs epoll code and got above conclusion. But I'm not
> the expert of the epoll API, so I'm not sure if I missed anything.

Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
loop by
  poll_fd_wait(handler->epoll_fd, POLLIN);

If you will call nl_sock_wait() on psample socket, this socket will be added
to the same poll loop with:
  nl_sock_wait(psample_sock, POLLIN) ->
    poll_fd_wait(psample_sock->fd, POLLIN);

This way psample socket will become an *additional* source for waking up
for the thread that called nl_sock_wait().  So, this handler will be waken
up if POLLIN happened on a psample socket even if there are no miss upcalls.

The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
threads knows nothing about it.  poll_loop() knows nothing about this epoll
stuff too, it just adds epoll_fd itself to the list of fds for the usual
poll() because of the poll_fd_wait() call.  psample socket will end up in
the same usual poll().

> 
> Thanks,
> Chris
>>
>> So, there should be several call chains:
>>
>> 1. Init.
>>
>>     open_dpif_backer/type_run() ->
>>       netdev_offload_recv_set() ->
>>         netdev_offload_tc_recv_set(enabled) ->
>>           if (enable)
>>             psample_sock = nl_sock_create(...);
>>           else
>>             close(psample_sock);
>>
>> 2. Wait.
>>
>>     udpif_upcall_handler() ->
>>       netdev_offload_recv_wait() ->
>>         netdev_offload_tc_recv_wait() ->
>>           if (handler_id == 0)
>>             nl_sock_wait(psample_sock, POLLIN);
>>
>> 3. Receive.
>>
>>     udpif_upcall_handler() ->
>>       recv_upcalls() ->
>>         netdev_offload_recv() ->
>>         | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>         |   if (handler_id == 0)
>>         |     nl_sock_recv(psample_socket, ..., wait = false);
>>         |     *upcall = <received data>;
>>         upcall_receive()
>>         process_upcall() ->
>>           dpif_sflow_received()
>>
>> 4. Deinit.
>>
>>     close_dpif_backer() ->
>>       netdev_offload_recv_set(enable = false);
>>
>>
>> Does that look more clear?
>>
>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>> So if you have time, could you please elaborate?
>>>
>>> Thanks,
>>> Chris
>>>> This way we could avoid touching dpif-netdev and still have psample
>>>> offloading for the case where netdev-offload-tc is used from the
>>>> userspace datapath.
>>>>
>>>> Above architecture also implies implementation of:
>>>>    - netdev_offload_recv_wait()
>>>>    - netdev_offload_recv_purge()
>>>>    - and the netdev_offload_tc_* counterparts.
>>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>
>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>
>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>
>>>>>>> v2-v1:
>>>>>>> - Fix robot errors.
>>>>>>> v3-v2:
>>>>>>> - Remove Gerrit Change-Id.
>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>> - Add travis test result.
>>>>>>> v4-v3:
>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>> v5-v4:
>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>> v6-v5:
>>>>>>> - Rebase.
>>>>>>> - Add GitHub Actions test result.
>>>>>>> v7-v6:
>>>>>>> - Remove Gerrit Change-Id.
>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>> v8-v7
>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>> v9-v8
>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>> - Log a debug message for other userspace actions.
>>>>>>> v10-v9
>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>> v11-v10
>>>>>>> - Fix a bracing error.
>>>>>>> v12-v11
>>>>>>> - Add duplicate sample group id check.
>>>>>>>
>>>>>>> Chris Mi (11):
>>>>>>>      compat: Add psample and tc sample action defines for older kernels
>>>>>>>      ovs-kmod-ctl: Load kernel module psample
>>>>>>>      dpif: Introduce register sFlow upcall callback API
>>>>>>>      ofproto: Add upcall callback to process sFlow packet
>>>>>>>      netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>      netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>      dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>      netdev-offload-tc: Introduce group ID management API
>>>>>>>      netdev-offload-tc: Create psample netlink socket
>>>>>>>      netdev-offload-tc: Add psample receive handler
>>>>>>>      netdev-offload-tc: Add offload support for sFlow
>>>>>>>
>>>>>>>     include/linux/automake.mk        |   4 +-
>>>>>>>     include/linux/psample.h          |  58 +++
>>>>>>>     include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>     lib/dpif-netdev.c                |   1 +
>>>>>>>     lib/dpif-netlink.c               |  27 ++
>>>>>>>     lib/dpif-netlink.h               |   4 +
>>>>>>>     lib/dpif-provider.h              |  10 +
>>>>>>>     lib/dpif.c                       |   8 +
>>>>>>>     lib/dpif.h                       |  23 ++
>>>>>>>     lib/netdev-offload-provider.h    |   3 +
>>>>>>>     lib/netdev-offload-tc.c          | 659 ++++++++++++++++++++++++++++++-
>>>>>>>     lib/netdev-offload.c             |  30 ++
>>>>>>>     lib/netdev-offload.h             |   4 +
>>>>>>>     lib/tc.c                         |  61 ++-
>>>>>>>     lib/tc.h                         |  16 +-
>>>>>>>     ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>     utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>     17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>     create mode 100644 include/linux/psample.h
>>>>>>>     create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Can you help review for this series?
>>>>>> do you have any comments you want us to handle?
>>>>>>
>>>>>> Thanks,
>>>>>> Roi
> 



More information about the dev mailing list