[ovs-dev] [PATCH v12 00/11] Add offload support for sFlow

Ilya Maximets i.maximets at ovn.org
Tue May 18 12:22:26 UTC 2021


On 4/27/21 3:23 AM, Chris Mi wrote:
> On 3/24/2021 8:14 PM, Ilya Maximets wrote:
>> On 3/24/21 10:17 AM, Chris Mi wrote:
>>> On 3/23/2021 10:24 PM, Ilya Maximets wrote:
>>>> On 3/5/21 4:27 AM, Chris Mi wrote:
>>>>> Hi Ilya,
>>>>>
>>>>> I think about your suggestion recently. But I'm still not very clear about the design.
>>>>> Please see my reply below:
>>>>>
>>>>> On 3/1/2021 8:48 PM, Ilya Maximets wrote:
>>>>>> On 3/1/21 9:30 AM, Chris Mi wrote:
>>>>>>> Hi Simon, Ilya,
>>>>>>>
>>>>>>> Could I know what should we do to make progress for this patch set?
>>>>>>> It has been posted in the community for a long time 😁
>>>>>> In general, the way to get your patches reviewed is to review other
>>>>>> patches.  It's simply because we still have a huge review backlog
>>>>>> (214 patches right now in patchwork and most of them needs review)
>>>>>> and bugfixes usually has a bit higher priority.  By reviewing other
>>>>>> patches you're reducing amount of work for maintainers so they can
>>>>>> get to your patches faster.
>>>>> OK, I see.
>>>>>> For the series and comments from Eelco:
>>>>>> I didn't read the patches carefully, only a quick glance, but I still
>>>>>> do not understand why we need a separate thread to poll psample events.
>>>>>> Why can't we just allow usual handler threads to do that?
>>>>> I'm not sure if you are aware of that the psample netlink is different from the ovs
>>>>> netlink. Without offload, kernel sends missed packet and sFlow packet to userspace
>>>>> using the same netlink 'ovs_packet_family'. So they can use the same handler thread.
>>>>> But in order to offload sFlow action, we use psample kernel module to send sampled
>>>>> packets from kernel to userspace. The format for ovs netlink message and psample
>>>>> netlink messages are different.
>>>> Hi.  Sorry for late reply.
>>>>
>>>> Yes, I understand that message format is different, but it doesn't really
>>>> matter.  All the message-specific handling and parsing should happen
>>>> inside the netdev_offload_tc_recv().  This function should have a prototype
>>>> similar to dpif_netlink_recv(), i.e. it should receive a pointer to the
>>>> struct dpif_upcall from the caller and fill it with data.  Maybe other
>>>> type of a generic data structure if it's really not possible to construct
>>>> struct dpif_upcall.
>>>>
>>>>
>>>>>>      From the
>>>>>> architecture perspective it's not a good thing to call ofproto code
>>>>>> from the offload-provider.  This introduces lots of complications
>>>>>> and might cause lots of issues in the future.
>>>>>>
>>>>>> I'd say that design should look like this:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        dpif_recv() ->
>>>>>>          dpif_netlink_recv() ->
>>>>>>            netdev_offload_recv() ->
>>>>>>              netdev_offload_tc_recv() ->
>>>>>>                nl_sock_recv()
>>>>> In order to use the handler thread, I'm not sure if we should add psample socket
>>>>> fd to every handler's epoll_fd. If we should do that, we should consider how to
>>>>> differentiate if the event comes from ovs or psample netlink. Maybe we should
>>>>> allocate a special port number for psample and assign it to event.data.u32.
>>>>> Anyway, that's the details. If this is the right direction, I'll think about it.
>>>>>> Last three calls should be implemented.
>>>>>>
>>>>>> The better version of this will be to throw away dpif part from
>>>>>> above call chain and make it:
>>>>>>
>>>>>>      handler thread ->
>>>>>>        netdev_offload_recv() ->
>>>>>>          netdev_offload_tc_recv() ->
>>>>>>            nl_sock_recv()
>>>>> If we throw away dpif part, maybe we have to write some duplicate epoll code
>>>>> for psample only. And we can't block in nl_sock_recv(). Maybe we have to add
>>>>> psample socket fd to every handler's epoll_fd. So we have to change the dpif
>>>>> somehow.
>>>> There is no need to implement any epoll_fd logic for psample.
>>>> You may only use handler with handler_id == 0.  Only this handler will receive
>>>> psample upcalls.  netdev_offload_recv_wait() should be implemented similar
>>>> to how dpif_netlink_recv_wait() implemented for windows case, i.e. it will
>>>> call nl_sock_wait(nl_sock, POLLIN) for the psample netlink socket.
>>>> There is no need to block and you're never blocking anywhere because your're
>>>> calling nl_sock_recv(..., wait = false) which will use MSG_DONTWAIT while
>>>> actually receiving a message.
>>> Hi Ilya,
>>>
>>> With this new design, the code will be changed greatly. I'm not unwilling to change it.
>>> The effort is not small. So before I start, I want to make sure that it is feasible.
>>>
>>> Yes, we won't block in nl_sock_recv(), but the handler thread will be blocked in poll_block().
>>> If any of the vport netlink socket is readable , it will be waken up by kernel. If we don't use
>>> epoll_ctl to add the psample netlink socket to the handler's epoll fd, we can't receive the
>>> psample packet as soon as possible. That means we can only receive the sampled packet
>>> when there is a miss upcall.
>>>
>>> I added some debug message in ovs epoll code and got above conclusion. But I'm not
>>> the expert of the epoll API, so I'm not sure if I missed anything.
>> Handler thread wakes up on POLLIN on epoll_fd itself, not on the event on one
>> of the file descriptors added to epoll.  epoll_fd is added to a thread's poll
>> loop by
>>    poll_fd_wait(handler->epoll_fd, POLLIN);
>>
>> If you will call nl_sock_wait() on psample socket, this socket will be added
>> to the same poll loop with:
>>    nl_sock_wait(psample_sock, POLLIN) ->
>>      poll_fd_wait(psample_sock->fd, POLLIN);
>>
>> This way psample socket will become an *additional* source for waking up
>> for the thread that called nl_sock_wait().  So, this handler will be waken
>> up if POLLIN happened on a psample socket even if there are no miss upcalls.
>>
>> The whole epoll infrastructure is local to lib/dpif-netlink.c and handler
>> threads knows nothing about it.  poll_loop() knows nothing about this epoll
>> stuff too, it just adds epoll_fd itself to the list of fds for the usual
>> poll() because of the poll_fd_wait() call.  psample socket will end up in
>> the same usual poll().
> Hi Ilya,
> 
> The code according to your suggestion is ready. But during the internal code review,
> Eli Britstein thought flow_api is netdev based, but the psample/sFlow offload is
> not netdev based, but dpif based. Maybe we should introduce dpif-offload-provider
> and have a dpf_offload_api, so it won't be related to a specific netdev, but to a dpif type.
> 
> Could I know what's your opinion about it?

This might be not bad idea in general.  The problem is that if we'll introduce
this kind of API, we'll need to remove the usage of netdev-offload API from
all the layers higher than dpif implementation including dpif.c and make all
offloading go through dpif-offload-provider.

In general, I think, it might be good to have a separate dpif type, e.g.
dpif-offload that will use netdev-offload APIs internally and provide offloading
service.  In this case, dpif_operate(), depending on offload configuration, will
try to install flows to dpif-offload first and fallback to install flows to the
main dpif implementation if offload failed or if it was partial (i.e.
classification offloading).  This way we will delete all the offloading related
code from dpif-netdev and dpif-netlink and will have a unified offloading
architecture.  It's not easy to implement and there will be a lot of issues that
we'll need to figure out how to fix (e.g. dpif-netdev performs upcalls by itself,
so it will have to install resulted flows by calling a dpif_flow_put/operate and
not install flows directly to its local flow tables).  Anyway, this looks like a
long run and will require rework of the whole offloading architecture, so might
be a separate thing to work on.  Thoughts?

Best regards, Ilya Maximets.


More information about the dev mailing list