[ovs-dev] Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Eitan Eliahu eliahue at vmware.com
Fri Aug 8 20:13:05 UTC 2014



>Are you sure the Cancel Irp callback that you set actually does get called?
The process could call 

[QUOTE]We don't want to add per FILE_OBJECT buffer management.
[/QUOTE]
How would you do different for dumps?
I mean, during the time you issue a DeviceIoControl / read for a dump, if another thread happens to do DeviceIoControl to read a dump, how would you distinguish the buffers? Or it is a case that can & will never happen?

For operations where the reply can be given and read all at once, and using a DeviceIoControl, you don't need a buffer associated to the FILE_OBJECT.

For multicast operations, how do you the cleanup? I mean, I expect that multiple sockets can be registered for the same multicast group.
Also, is it possible in userspace to use a socket that is registered for a multicast group, to do other kind of operation, such as a "Flow/New" or a Dump?

Also, for packet queuing to the userspace, how would you do the cleanup in case the process crashes?

[QUOTE]Regarding the flow dump, I am not opposing to use your method to compute the amount of memory as long as that the read process will be in parts )rather allocating a huge buffer) and that each user / kernel mode transaction is self contained (which means the driver does not have to "remember" anything).[/QUOTE] No huge allocation is done in userspace. Only in kernel it can happen: if the whole flow dump reply will need 100 MB, 100 MB will be allocated in kernel for the buffer, while the process will read in chunks of say, 8KB.

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 9:38 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

We handle process termination and we cancel the pending IRPs. We don't want to add per FILE_OBJECT buffer management.

Regarding the flow dump, I am not opposing to use your method to compute the amount of memory as long as that the read process will be in parts )rather allocating a huge buffer) and that each user / kernel mode transaction is self contained (which means the driver does not have to "remember" anything).

-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Friday, August 08, 2014 11:32 AM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

[QUOTE]This means that the driver does not have to track per process resources so if the process is killed or crashed nothing needs to be done.
[/QUOTE]
[QUOTE]To maintain per process and per transaction resources or buffers requires a lot of book keeping as well as handling abnormal termination of the process.
[/QUOTE].
Eitan, this is very simple to handle: when the process crashes, its file HANDLE gets closed. When the file handle is closed, there is an IRP comming with the corresponding FILE_OBJECT ptr.
Given that we keep resources associated with FILE_OBJECT ptrs, this means that when the crash happens, we destroy the dump reply data associated. No additional effort.
You don't need to worry about the process at all!

[QUOTE]
This may work for DP and Port dump (provided the buffer is big enough. But for flow the dump is done in iteration. This means that someone needs to keep track on the progress (maintain current offset) [/QUOTE] The problem in our implementation would not be "if the buffer is too small" but "if the buffer is way too big" (such as that, say, a flow dump would requires hundreds of megabytes or more).
In our implementation, we compute the total size of the buffer that the userspace is going to read from BEFORE allocating it.

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 9:23 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

> Basically, this means that you don't need to provide an "offset" from userspace for a DeviceIoControl, unless you want the kernel to restrain himself from providing the whole dump reply at once.
This may work for DP and Port dump (provided the buffer is big enough. But for flow the dump is done in iteration. This means that someone needs to keep track on the progress (maintain current offset) Eitan

-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Friday, August 08, 2014 11:16 AM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Another word for "The only "state" we use in kernel for dumps is the buffer itself, with its offset. It is not a "dump" offset, but a buffer offset.":
I remember the linux ovs kernel code, that it used some kind of "dump offset" (or, I don't know how to call it, that "index-based" dumping).
I did not take that approach when implementing the windows netlink dump. And it worked correctly with the same userspace interface that it is on linux.

Basically, this means that you don't need to provide an "offset" from userspace for a DeviceIoControl, unless you want the kernel to restrain himself from providing the whole dump reply at once.

Sam
________________________________________
From: Samuel Ghinet
Sent: Friday, August 08, 2014 9:11 PM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Eitan,

[QUOTE]).For example the offset for a port dump would be the last port which was returned by the previous nl_dump_next().[/QUOTE] Our kernel implementation of the netlink component does not maintain such state for dumps.
The only "state" we use in kernel for dumps is the buffer itself, with its offset. It is not a "dump" offset, but a buffer offset.

Also, the offsets for buffers are used for normal (unicast) transactions as well (flow new, flow set, etc.). So basically, this means that if you want to read the reply of a "Flow/New" and provide a small buffer, you can read incrementally. However, the linux userspace, I think, simply provides a buffer it considers large enough to read the whole reply for this particular case.

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 8:34 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Sam, we want to avoid any state maintenance in the driver.  The "offset" here is not a buffer offset rather an "offset" related to the specific dump command (DP. Port or Flow). For example the offset for a port dump would be the last port which was returned by the previous nl_dump_next().
The IOCTRT command sent from dump_next should be self contained (which means specifies the "offset: and the output buffer length).
Eitan

-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Friday, August 08, 2014 10:25 AM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Hello Eitan,

[QUOTE]> You cannot do this dump() operation in a single DeviceIoControl request.
Certainly, not. I meant that each nl_dump_next use DeviceIOControl to retrieve the next chunk of data. We need to pass down to the driver an offset which indicates where this current chunk of data starts.

Did I answer your question?
[/QUOTE]
Yes, it answers the question.

But still, your aim:
"The Driver should not have to maintain a state or resources for transaction or dumps"
still cannot apply for dumps: you have the buffer, which is a state, that must be maintained across IOCTL requests.
And I don't see how keeping the offset in userspace will help you for this case.
If you look in our project, in WinlDevice.c and BufferControl.c you will see that there is no great difficulty in storing the buffer offset in kernel.
Is there any good reason why the offset should be stored in the userspace, and not in kernel, for dumps?

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 8:13 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Sam,
Let's take the whole event mechanism (as currently implemented) out of this context as it is not Netlink specific, I will try to find your original email about Events.

> You cannot do this dump() operation in a single DeviceIoControl request.
Certainly, not. I meant that each nl_dump_next use DeviceIOControl to retrieve the next chunk of data. We need to pass down to the driver an offset which indicates where this current chunk of data starts.

Did I answer your question?
Thanks,
Eitan


-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Friday, August 08, 2014 10:00 AM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Hello Eitan,

This discussion is getting more and more intrincate, with quotes of quotes of quotes :)

[QUOTE]Currently there are two IOCTLs implanted in the driver. One which just read an event from the event queue and is called synchronously. The other one which is used just for the purpose of signaling and it is always pended in the driver. Once the pended IRP is completed, the event in the overlapped structure is signaled and the user mode will read the event queue synchronously (through the call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is set to false).
[/QUOTE]
[QUOTE]>How exactly are events queued by the kernel associated with the userspace? I mean, how do you register a "nic connected" event so that when an event happens, >you know you need to update userspace data for a nic, not do something else. Would there be IDs stored in the OvsEvent structs that would specify what kind of >events they are? Would we also need context data associated with these events?
This should be no different than the current implementation.
[/QUOTE]
Sorry, I have the issue that I don't actually understand very well your current "packet to userspace" mechanism with OvsEvent.
I had sent an email one or two weeks ago asking for some help on this matter - had received no help / no answer.

[QUOTE]Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is called with another large buffer allocated on the stuck. If the data returned from the driver exceeds the initial allocated buffer, the one on the stuck will be copied to a new increased sized buffer in the ofbuf.
[/QUOTE]
I'm not sure I understand what you mean, neither that you understood my point.
I know that in the netlink protocol implemented in userspace, you hold an ofbuf, which, when writing / creating the netlink message, it automatically grows (reallocates) when it sees that it needs more space.
I am not talking about writing a message, I am talking about reading.
Here, the "sequential reading" happens via dump_start, dump_next and dump_finished(). Where the dump_next() used to recv() more from buffer.
Look at how "nl_dump_start" is implemented in netlink-socket.c: it's a mere send. While "nl_dump_next" calls nl_dump_refill to recv() more from the kernel buffer.
You cannot do this dump() operation in a single DeviceIoControl request.
the linux netlink socket recv() and the windows file function ReadFile() both tell you only how much they succeed to read: you say "read max X" and they tell you back "I read Y <= X".
We have a situation where we say "read max X" but you have a buffer in kernel of possible size Z >X. it will tell you back "Ok, read that max X" but the dump will not be finished reading! You'll have to continue reading the file / receiving from socket until it gives you an error "no more data!".
This is something that CANNOT be implemented as DeviceIoControl.

[QUOTE]We should probably modify the implementation of nl_dump_recv() , so it will issue DeviceIOControl() call. In input buffer we need to pass down information about the "offset" of the current transaction so the driver will know from which position it needs to start the dump.
[/QUOTE]
I see absolutely no reason why we should implement the "read offset" of the buffer in userspace. This should better be opaque to the userspace. Like a stream: you tell it "read!" and it reads from where it left of. This is how you normally read from a file, and this is how you normally receive from a socket.
And this "offset" implemented in userspace will still not save you from keeping state, because an additional Read from the same buffer still requires a "state" be preserved in kernel, among IOCTL calls: the buffer itself.

[QUOTE]
My concern would be that we should not break the Netlink protocol as it was selected in order to maintain user/kernel mode interoperability.
[/QUOTE]
Only if we cannot break the netlink protocol: if we figure a way where we do not use netlink, but only in cases where the linux ovs doesn't use netlink (linux ovs uses some ioctl-s to set options on sockets, or gets some configs on netdevs, etc.)

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 7:24 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Hi Sam,
Please find inline.

>Do you mean we will no longer use nl_sock_transact_multiple in userspace for these DPIF transactions?
No, we will use nl_sock_transact_multiple, but nl_sock_transact_multiple will be implemented through the call of DeviceIOControl() system call rather than se series of WriteFile()/ReadFile() pairs.

>[QUOTE]>You mean, whenever, say, a Flow dump request is issued, in one 
>reply to give back all flows?> Not necessarily. I meant that the driver does not have to maintain the state of the dump command.
>Each dump command sent down to the driver would be self-contained. [/QUOTE] We currently have this in our implementation. The only thing 'left' would be the >fact that we provide all the output buffer for dump at once. The userspace can read sequentially from it. Unless there is a reason to write sequentially from the >kernel to the userspace, and wait for the userspace to read, I think that how we have this one is ok.
Sound good, We can leverage your implementation.

>[QUOTE]Yes, these are OVS events that are placed in a custom queue.
>There is a single Operating System event associated with the global socket which collects all OVS events.
>It will be triggered through a completion of a pending I/O request in the driver.[/QUOTE] I used to be a bit confused of your implementation in OvsEvent and >OvsUser. Perhaps this discussion would clarify a bit more things. :) Ok, so we'll hold OVERLAPPED structs in the kernel, as events. What kind of IRP requests would be >returned as "pending" in the kernel? Requests coming as "nl_sock_recv()" on the multicast groups?
>Will there be multiple multicast groups used? or all multicast operations would queue events on the same event queue, where all the events are read from the same >part of code in userspace?
Currently there are two IOCTLs implanted in the driver. One which just read an event from the event queue and is called synchronously. The other one which is used just for the purpose of signaling and it is always pended in the driver. Once the pended IRP is completed, the event in the overlapped structure is signaled and the user mode will read the event queue synchronously (through the call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is set to false).

>How exactly are events queued by the kernel associated with the userspace? I mean, how do you register a "nic connected" event so that when an event happens, >you know you need to update userspace data for a nic, not do something else. Would there be IDs stored in the OvsEvent structs that would specify what kind of >events they are? Would we also need context data associated with these events?
This should be no different than the current implementation. These events are generated when NDIS calls the switch extension callback on switch port creation/deletion/link up/down etc..

>[QUOTE]>However, I think we need to take into account the situation where the userspace might be providing a smaller buffer than it is the total to read. Also, I >think the "dump" mechanism requires it.
>I (want) to assume that each transaction is self-contained which means that the driver should not maintain a state of the transaction. Since, we will be using an IOCTL for that transaction the user mode buffer length will be specified in the command itself.
>All Write/Read dump pairs are replaced with a single IOCTL call.[/QUOTE] That still did not answer my question :) You mean to use a very large read buffer, so that >you would be able to read all in one single operation? I am more concerned here about flow dumps, because you may not know whether you need an 1024 bytes >buffer or an 10240 byes buffer, or an 102400 bytes buffer, or etc.
Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is called with another large buffer allocated on the stuck. If the data returned from the driver exceeds the initial allocated buffer, the one on the stuck will be copied to a new increased sized buffer in the ofbuf.


>So I do not see how a DeviceIoControl operation could do both the 'write' and the 'read' part for the dump.
We should probably modify the implementation of nl_dump_recv() , so it will issue DeviceIOControl() call. In input buffer we need to pass down information about the "offset" of the current transaction so the driver will know from which position it needs to start the dump.

>If you pass to the DeviceIoControl a buffer length = 8000, and the flow dump reply buffer is 32000 bytes, you need to do additional reads AND maintain state in the >kernel (e.g. offset in the kernel read buffer).
I would like to hold the "offset" in user mode (probably add a field in nl_dump structure for WIN32)

> [QUOTE]o) I believe we shouldn't use the netlink overhead (nlmsghdr, genlmsghdr, attributes) when not needed (say, when registering a KEVENT notification) , >and, if w>e choose not to use netlink protocol always, we may need a way to differentiate between netlink and non-netlink requests.
>Possible, as phase for optimization[/QUOTE] Not necessarily: if we can make a clear separation in code between netlink and non-netlink km-um, not using netlink >where we don't need to might save us some development & maintainability effort - both in kernel and in userspace. Because otherwise we'd need to turn non->netlink messages of (windows) userspace code into netlink messages.
My concern would be that we should not break the Netlink protocol as it was selected in order to maintain user/kernel mode interoperability.

Eitan




More information about the dev mailing list