[ovs-dev] [PATCH v12 00/11] Add offload support for sFlow
Chris Mi
cmi at nvidia.com
Wed Mar 24 13:03:02 UTC 2021
On 3/24/2021 8:14 PM, Ilya Maximets wrote:
> On 3/24/21 10:17 AM, Chris Mi wrote:
>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>> Hi Ilya,
>>>>
>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>> Please see my reply below:
>>>>
>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>> Hi Simon, Ilya,
>>>>>>
>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>> It has been posted in the community for a long time 😁
>>>>> In general, the way to get your patches reviewed is to review other
>>>>> patches. It's simply because we still have a huge review backlog
>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>> and bugfixes usually has a bit higher priority. By reviewing other
>>>>> patches you're reducing amount of work for maintainers so they can
>>>>> get to your patches faster.
>>>> OK, I see.
>>>>> For the series and comments from Eelco:
>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>> do not understand why we need a separate thread to poll psample events.
>>>>> Why can't we just allow usual handler threads to do that?
>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>> netlink messages are different.
>>> Hi. Sorry for late reply.
>>>
>>> Yes, I understand that message format is different, but it doesn't really
>>> matter. All the message-specific handling and parsing should happen
>>> inside the netdev_offload_tc_recv(). This function should have a prototype
>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>> struct dpif_upcall from the caller and fill it with data. Maybe other
>>> type of a generic data structure if it's really not possible to construct
>>> struct dpif_upcall.
>>>
>>>
>>>>> From the
>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>> from the offload-provider. This introduces lots of complications
>>>>> and might cause lots of issues in the future.
>>>>>
>>>>> I'd say that design should look like this:
>>>>>
>>>>> handler thread ->
>>>>> dpif_recv() ->
>>>>> dpif_netlink_recv() ->
>>>>> netdev_offload_recv() ->
>>>>> netdev_offload_tc_recv() ->
>>>>> nl_sock_recv()
>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>> Last three calls should be implemented.
>>>>>
>>>>> The better version of this will be to throw away dpif part from
>>>>> above call chain and make it:
>>>>>
>>>>> handler thread ->
>>>>> netdev_offload_recv() ->
>>>>> netdev_offload_tc_recv() ->
>>>>> nl_sock_recv()
>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>> somehow.
>>> There is no need to implement any epoll_fd logic for psample.
>>> You may only use handler with handler_id == 0. Only this handler will receive
>>> psample upcalls. netdev_offload_recv_wait() should be implemented similar
>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>> There is no need to block and you're never blocking anywhere because your're
>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>> actually receiving a message.
>> Hi Ilya,
>>
>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>
>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>> psample packet as soon as possible. That means we can only receive the sampled packet
>> when there is a miss upcall.
>>
>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>> the expert of the epoll API, so I'm not sure if I missed anything.
> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
> of the file descriptors added to epoll. epoll_fd is added to a thread's poll
> loop by
> poll_fd_wait(handler->epoll_fd, POLLIN);
>
> If you will call nl_sock_wait() on psample socket, this socket will be added
> to the same poll loop with:
> nl_sock_wait(psample_sock, POLLIN) ->
> poll_fd_wait(psample_sock->fd, POLLIN);
>
> This way psample socket will become an *additional* source for waking up
> for the thread that called nl_sock_wait(). So, this handler will be waken
> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>
> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
> threads knows nothing about it. poll_loop() knows nothing about this epoll
> stuff too, it just adds epoll_fd itself to the list of fds for the usual
> poll() because of the poll_fd_wait() call. psample socket will end up in
> the same usual poll().
Thank a lot for your detailed explanation, Ilya. Now I'm clear of it. I
will work on it.
-Chris
>> Thanks,
>> Chris
>>> So, there should be several call chains:
>>>
>>> 1. Init.
>>>
>>> open_dpif_backer/type_run() ->
>>> netdev_offload_recv_set() ->
>>> netdev_offload_tc_recv_set(enabled) ->
>>> if (enable)
>>> psample_sock = nl_sock_create(...);
>>> else
>>> close(psample_sock);
>>>
>>> 2. Wait.
>>>
>>> udpif_upcall_handler() ->
>>> netdev_offload_recv_wait() ->
>>> netdev_offload_tc_recv_wait() ->
>>> if (handler_id == 0)
>>> nl_sock_wait(psample_sock, POLLIN);
>>>
>>> 3. Receive.
>>>
>>> udpif_upcall_handler() ->
>>> recv_upcalls() ->
>>> netdev_offload_recv() ->
>>> | netdev_offload_tc_recv(..., struct dpif_upcall *upcall) ->
>>> | if (handler_id == 0)
>>> | nl_sock_recv(psample_socket, ..., wait = false);
>>> | *upcall = <received data>;
>>> upcall_receive()
>>> process_upcall() ->
>>> dpif_sflow_received()
>>>
>>> 4. Deinit.
>>>
>>> close_dpif_backer() ->
>>> netdev_offload_recv_set(enable = false);
>>>
>>>
>>> Does that look more clear?
>>>
>>>> I don't know if I understand your suggestion correctly. Or I missed anything
>>>> So if you have time, could you please elaborate?
>>>>
>>>> Thanks,
>>>> Chris
>>>>> This way we could avoid touching dpif-netdev and still have psample
>>>>> offloading for the case where netdev-offload-tc is used from the
>>>>> userspace datapath.
>>>>>
>>>>> Above architecture also implies implementation of:
>>>>> - netdev_offload_recv_wait()
>>>>> - netdev_offload_recv_purge()
>>>>> - and the netdev_offload_tc_* counterparts.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>>> On 2/23/2021 5:08 PM, Roi Dayan wrote:
>>>>>>> On 2021-01-27 8:23 AM, Chris Mi wrote:
>>>>>>>> This patch set adds offload support for sFlow.
>>>>>>>>
>>>>>>>> Psample is a genetlink channel for packet sampling. TC action act_sample
>>>>>>>> uses psample to send sampled packets to userspace.
>>>>>>>>
>>>>>>>> When offloading sample action to TC, userspace creates a unique ID to
>>>>>>>> map sFlow action and tunnel info and passes this ID to kernel instead
>>>>>>>> of the sFlow info. psample will send this ID and sampled packet to
>>>>>>>> userspace. Using the ID, userspace can recover the sFlow info and send
>>>>>>>> sampled packet to the right sFlow monitoring host.
>>>>>>>>
>>>>>>>> v2-v1:
>>>>>>>> - Fix robot errors.
>>>>>>>> v3-v2:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Add patch #9 to fix older kernels build issue.
>>>>>>>> - Add travis test result.
>>>>>>>> v4-v3:
>>>>>>>> - Fix offload issue when sampling rate is 1.
>>>>>>>> v5-v4:
>>>>>>>> - Move polling thread from ofproto to netdev-offload-tc.
>>>>>>>> v6-v5:
>>>>>>>> - Rebase.
>>>>>>>> - Add GitHub Actions test result.
>>>>>>>> v7-v6:
>>>>>>>> - Remove Gerrit Change-Id.
>>>>>>>> - Fix "ERROR: Inappropriate spacing around cast"
>>>>>>>> v8-v7
>>>>>>>> - Address Eelco Chaudron's comment for patch #11.
>>>>>>>> v9-v8
>>>>>>>> - Remove sflow_len from struct dpif_sflow_attr.
>>>>>>>> - Log a debug message for other userspace actions.
>>>>>>>> v10-v9
>>>>>>>> - Address Eelco Chaudron's comments on v9.
>>>>>>>> v11-v10
>>>>>>>> - Fix a bracing error.
>>>>>>>> v12-v11
>>>>>>>> - Add duplicate sample group id check.
>>>>>>>>
>>>>>>>> Chris Mi (11):
>>>>>>>> compat: Add psample and tc sample action defines for older kernels
>>>>>>>> ovs-kmod-ctl: Load kernel module psample
>>>>>>>> dpif: Introduce register sFlow upcall callback API
>>>>>>>> ofproto: Add upcall callback to process sFlow packet
>>>>>>>> netdev-offload: Introduce register sFlow upcall callback API
>>>>>>>> netdev-offload-tc: Implement register sFlow upcall callback API
>>>>>>>> dpif-netlink: Implement register sFlow upcall callback API
>>>>>>>> netdev-offload-tc: Introduce group ID management API
>>>>>>>> netdev-offload-tc: Create psample netlink socket
>>>>>>>> netdev-offload-tc: Add psample receive handler
>>>>>>>> netdev-offload-tc: Add offload support for sFlow
>>>>>>>>
>>>>>>>> include/linux/automake.mk | 4 +-
>>>>>>>> include/linux/psample.h | 58 +++
>>>>>>>> include/linux/tc_act/tc_sample.h | 25 ++
>>>>>>>> lib/dpif-netdev.c | 1 +
>>>>>>>> lib/dpif-netlink.c | 27 ++
>>>>>>>> lib/dpif-netlink.h | 4 +
>>>>>>>> lib/dpif-provider.h | 10 +
>>>>>>>> lib/dpif.c | 8 +
>>>>>>>> lib/dpif.h | 23 ++
>>>>>>>> lib/netdev-offload-provider.h | 3 +
>>>>>>>> lib/netdev-offload-tc.c | 659 ++++++++++++++++++++++++++++++-
>>>>>>>> lib/netdev-offload.c | 30 ++
>>>>>>>> lib/netdev-offload.h | 4 +
>>>>>>>> lib/tc.c | 61 ++-
>>>>>>>> lib/tc.h | 16 +-
>>>>>>>> ofproto/ofproto-dpif-upcall.c | 42 ++
>>>>>>>> utilities/ovs-kmod-ctl.in | 14 +
>>>>>>>> 17 files changed, 973 insertions(+), 16 deletions(-)
>>>>>>>> create mode 100644 include/linux/psample.h
>>>>>>>> create mode 100644 include/linux/tc_act/tc_sample.h
>>>>>>>>
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Can you help review for this series?
>>>>>>> do you have any comments you want us to handle?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Roi
More information about the dev
mailing list