[ovs-dev] [PATCH v12 00/11] Add offload support for sFlow

Eelco Chaudron echaudro at redhat.com
Mon Mar 29 09:46:19 UTC 2021



On 24 Mar 2021, at 14:03, Chris Mi wrote:

> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> I think about your suggestion recently. But I'm still not very 
>>>>> clear about the design.
>>>>> Please see my reply below:
>>>>>
>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Could I know what should we do to make progress for this patch 
>>>>>>> set?
>>>>>>> It has been posted in the community for a long time 😁
>>>>>> In general, the way to get your patches reviewed is to review 
>>>>>> other
>>>>>> patches.  It's simply because we still have a huge review 
>>>>>> backlog
>>>>>> (214 patches right now in patchwork and most of them needs 
>>>>>> review)
>>>>>> and bugfixes usually has a bit higher priority.  By reviewing 
>>>>>> other
>>>>>> patches you're reducing amount of work for maintainers so they 
>>>>>> can
>>>>>> get to your patches faster.
>>>>> OK, I see.
>>>>>> For the series and comments from Eelco:
>>>>>> I didn't read the patches carefully, only a quick glance, but I 
>>>>>> still
>>>>>> do not understand why we need a separate thread to poll psample 
>>>>>> events.
>>>>>> Why can't we just allow usual handler threads to do that?
>>>>> I'm not sure if you are aware of that the psample netlink is 
>>>>> different from the ovs
>>>>> netlink. Without offload, kernel sends missed packet and sFlow 
>>>>> packet to userspace
>>>>> using the same netlink 'ovs_packet_family'. So they can use the 
>>>>> same handler thread.
>>>>> But in order to offload sFlow action, we use psample kernel module 
>>>>> to send sampled
>>>>> packets from kernel to userspace. The format for ovs netlink 
>>>>> message and psample
>>>>> netlink messages are different.
>>>> Hi.  Sorry for late reply.
>>>>
>>>> Yes, I understand that message format is different, but it doesn't 
>>>> really
>>>> matter.  All the message-specific handling and parsing should 
>>>> happen
>>>> inside the netdev_offload_tc_recv().  This function should have a 
>>>> prototype
>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to 
>>>> the
>>>> struct dpif_upcall from the caller and fill it with data.  Maybe 
>>>> other
>>>> type of a generic data structure if it's really not possible to 
>>>> construct
>>>> struct dpif_upcall.
>>>>
>>>>
>>>>>>      From the
>>>>>> architecture perspective it's not a good thing to call ofproto 
>>>>>> code
>>>>>> from the offload-provider.  This introduces lots of 
>>>>>> complications
>>>>>> and might cause lots of issues in the future.
>>>>>>
>>>>>> I'd say that design should look like this:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        dpif_recv() ->
>>>>>>          dpif_netlink_recv() ->
>>>>>>            netdev_offload_recv() ->
>>>>>>              netdev_offload_tc_recv() ->
>>>>>>                nl_sock_recv()
>>>>> In order to use the handler thread, I'm not sure if we should add 
>>>>> psample socket
>>>>> fd to every handler's epoll_fd. If we should do that, we should 
>>>>> consider how to
>>>>> differentiate if the event comes from ovs or psample netlink. 
>>>>> Maybe we should
>>>>> allocate a special port number for psample and assign it to 
>>>>> event.data.u32.
>>>>> Anyway, that's the details. If this is the right direction, I'll 
>>>>> think about it.
>>>>>> Last three calls should be implemented.
>>>>>>
>>>>>> The better version of this will be to throw away dpif part from
>>>>>> above call chain and make it:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        netdev_offload_recv() ->
>>>>>>          netdev_offload_tc_recv() ->
>>>>>>            nl_sock_recv()
>>>>> If we throw away dpif part, maybe we have to write some duplicate 
>>>>> epoll code
>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we 
>>>>> have to add
>>>>> psample socket fd to every handler's epoll_fd. So we have to 
>>>>> change the dpif
>>>>> somehow.
>>>> There is no need to implement any epoll_fd logic for psample.
>>>> You may only use handler with handler_id == 0.  Only this handler 
>>>> will receive
>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented 
>>>> similar
>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. 
>>>> it will
>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>> There is no need to block and you're never blocking anywhere 
>>>> because your're
>>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT 
>>>> while
>>>> actually receiving a message.
>>> Hi Ilya,
>>>
>>> With this new design, the code will be changed greatly. I'm not 
>>> unwilling to change it.
>>> The effort is not small. So before I start, I want to make sure that 
>>> it is feasible.
>>>
>>> Yes, we won't block in nl_sock_recv(), but the handler thread will 
>>> be blocked in poll_block().
>>> If any of the vport netlink socket is readable , it will be waken up 
>>> by kernel. If we don't use
>>> epoll_ctl to add the psample netlink socket to the handler's epoll 
>>> fd, we can't receive the
>>> psample packet as soon as possible. That means we can only receive 
>>> the sampled packet
>>> when there is a miss upcall.
>>>
>>> I added some debug message in ovs epoll code and got above 
>>> conclusion. But I'm not
>>> the expert of the epoll API, so I'm not sure if I missed anything.
>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the 
>> event on one
>> of the file descriptors added to epoll.  epoll_fd is added to a 
>> thread's poll
>> loop by
>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>
>> If you will call nl_sock_wait() on psample socket, this socket will 
>> be added
>> to the same poll loop with:
>>    nl_sock_wait(psample_sock, POLLIN) ->
>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>
>> This way psample socket will become an *additional* source for waking 
>> up
>> for the thread that called nl_sock_wait().  So, this handler will be 
>> waken
>> up if POLLIN happened on a psample socket even if there are no miss 
>> upcalls.
>>
>> The whole epoll infrastructure is local to lib/dpif-netlink.c and 
>> handler
>> threads knows nothing about it.  poll_loop() knows nothing about this 
>> epoll
>> stuff too, it just adds epoll_fd itself to the list of fds for the 
>> usual
>> poll() because of the poll_fd_wait() call.  psample socket will end 
>> up in
>> the same usual poll().
> Thank a lot for your detailed explanation, Ilya. Now I'm clear of it. 
> I will work on it.

Guys did you also look at my other objection on the
“[PATCH v9 08/11] netdev-offload-tc: Introduce group ID management 
API” as I still feel this some rework. Ilya your comments on this 
would help set some directions for Chris.

Cheers,

Eelco


>>> Thanks,
>>> Chris
>>>> So, there should be several call chains:
>>>>
>>>> 1. Init.
>>>>
>>>>      open_dpif_backer/type_run() ->
>>>>        netdev_offload_recv_set() ->
>>>>          netdev_offload_tc_recv_set(enabled) ->
>>>>            if (enable)
>>>>              psample_sock = nl_sock_create(...);
>>>>            else
>>>>              close(psample_sock);
>>>>
>>>> 2. Wait.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        netdev_offload_recv_wait() ->
>>>>          netdev_offload_tc_recv_wait() ->
>>>>            if (handler_id == 0)
>>>>              nl_sock_wait(psample_sock, POLLIN);
>>>>
>>>> 3. Receive.
>>>>
>>>>      udpif_upcall_handler() ->
>>>>        recv_upcalls() ->
>>>>          netdev_offload_recv() ->
>>>>          | netdev_offload_tc_recv(..., struct dpif_upcall 
>>>> *upcall) ->
>>>>          |   if (handler_id == 0)
>>>>          |     nl_sock_recv(psample_socket, ..., wait = 
>>>> false);
>>>>          |     *upcall = <received data>;
>>>>          upcall_receive()
>>>>          process_upcall() ->
>>>>            dpif_sflow_received()
>>>>
>>>> 4. Deinit.
>>>>
>>>>      close_dpif_backer() ->
>>>>        netdev_offload_recv_set(enable = false);
>>>>
>>>>
>>>> Does that look more clear?
>>>>
>>>>> I don't know if I understand your suggestion correctly. Or I 
>>>>> missed anything
>>>>> So if you have time, could you please elaborate?
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>> This way we could avoid touching dpif-netdev and still have 
>>>>>> psample
>>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>>> userspace datapath.
>>>>>>
>>>>>> Above architecture also implies implementation of:
>>>>>>     - netdev_offload_recv_wait()
>>>>>>     - netdev_offload_recv_purge()
>>>>>>     - and the netdev_offload_tc_* counterparts.
>>>>>>
>>>>>> Best regards, Ilya Maximets.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>>
>>>>>>>>> Psample is a genetlink channel for packet sampling. TC action 
>>>>>>>>> act_sample
>>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>>
>>>>>>>>> When offloading sample action to TC, userspace creates a 
>>>>>>>>> unique ID to
>>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel 
>>>>>>>>> instead
>>>>>>>>> of the sFlow info. psample will send this ID and sampled 
>>>>>>>>> packet to
>>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info 
>>>>>>>>> and send
>>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>>
>>>>>>>>> v2-v1:
>>>>>>>>> - Fix robot errors.
>>>>>>>>> v3-v2:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>>> - Add travis test result.
>>>>>>>>> v4-v3:
>>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>>> v5-v4:
>>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>>> v6-v5:
>>>>>>>>> - Rebase.
>>>>>>>>> - Add GitHub Actions test result.
>>>>>>>>> v7-v6:
>>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>>> v8-v7
>>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>>> v9-v8
>>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>>> v10-v9
>>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>>> v11-v10
>>>>>>>>> - Fix a bracing error.
>>>>>>>>> v12-v11
>>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>>
>>>>>>>>> Chris Mi (11):
>>>>>>>>>       compat: Add psample and tc sample action defines for 
>>>>>>>>> older kernels
>>>>>>>>>       ovs-kmod-ctl: Load kernel module psample
>>>>>>>>>       dpif: Introduce register sFlow upcall callback API
>>>>>>>>>       ofproto: Add upcall callback to process sFlow packet
>>>>>>>>>       netdev-offload: Introduce register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       netdev-offload-tc: Implement register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       dpif-netlink: Implement register sFlow upcall 
>>>>>>>>> callback API
>>>>>>>>>       netdev-offload-tc: Introduce group ID management API
>>>>>>>>>       netdev-offload-tc: Create psample netlink socket
>>>>>>>>>       netdev-offload-tc: Add psample receive handler
>>>>>>>>>       netdev-offload-tc: Add offload support for sFlow
>>>>>>>>>
>>>>>>>>>      include/linux/automake.mk        |   4 +-
>>>>>>>>>      include/linux/psample.h          |  58 +++
>>>>>>>>>      include/linux/tc_act/tc_sample.h |  25 ++
>>>>>>>>>      lib/dpif-netdev.c                |   
>>>>>>>>> 1 +
>>>>>>>>>      lib/dpif-netlink.c               |  27 
>>>>>>>>> ++
>>>>>>>>>      lib/dpif-netlink.h               |   4 
>>>>>>>>> +
>>>>>>>>>      lib/dpif-provider.h              |  10 +
>>>>>>>>>      lib/dpif.c                       
>>>>>>>>> |   8 +
>>>>>>>>>      lib/dpif.h                       
>>>>>>>>> |  23 ++
>>>>>>>>>      lib/netdev-offload-provider.h    |   3 +
>>>>>>>>>      lib/netdev-offload-tc.c          | 659 
>>>>>>>>> ++++++++++++++++++++++++++++++-
>>>>>>>>>      lib/netdev-offload.c             |  30 ++
>>>>>>>>>      lib/netdev-offload.h             |   4 +
>>>>>>>>>      
>>>>>>>>> lib/tc.c                         |  
>>>>>>>>> 61 ++-
>>>>>>>>>      
>>>>>>>>> lib/tc.h                         |  
>>>>>>>>> 16 +-
>>>>>>>>>      ofproto/ofproto-dpif-upcall.c    |  42 ++
>>>>>>>>>      utilities/ovs-kmod-ctl.in        |  14 +
>>>>>>>>>      17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>>>      create mode 100644 include/linux/psample.h
>>>>>>>>>      create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>>
>>>>>>>> Hi Simon, Ilya,
>>>>>>>>
>>>>>>>> Can you help review for this series?
>>>>>>>> do you have any comments you want us to handle?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Roi



More information about the dev mailing list