[ovs-dev] Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Eitan Eliahu eliahue at vmware.com
Fri Aug 8 17:13:55 UTC 2014


Sam,
Let's take the whole event mechanism (as currently implemented) out of this context as it is not Netlink specific, I will try to find your original email about Events.

> You cannot do this dump() operation in a single DeviceIoControl request.
Certainly, not. I meant that each nl_dump_next use DeviceIOControl to retrieve the next chunk of data. We need to pass down to the driver an offset which indicates where this current chunk of data starts.

Did I answer your question?
Thanks,
Eitan


-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com] 
Sent: Friday, August 08, 2014 10:00 AM
To: Eitan Eliahu; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Hello Eitan,

This discussion is getting more and more intrincate, with quotes of quotes of quotes :)

[QUOTE]Currently there are two IOCTLs implanted in the driver. One which just read an event from the event queue and is called synchronously. The other one which is used just for the purpose of signaling and it is always pended in the driver. Once the pended IRP is completed, the event in the overlapped structure is signaled and the user mode will read the event queue synchronously (through the call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is set to false).
[/QUOTE]
[QUOTE]>How exactly are events queued by the kernel associated with the userspace? I mean, how do you register a "nic connected" event so that when an event happens, >you know you need to update userspace data for a nic, not do something else. Would there be IDs stored in the OvsEvent structs that would specify what kind of >events they are? Would we also need context data associated with these events?
This should be no different than the current implementation.
[/QUOTE]
Sorry, I have the issue that I don't actually understand very well your current "packet to userspace" mechanism with OvsEvent.
I had sent an email one or two weeks ago asking for some help on this matter - had received no help / no answer.

[QUOTE]Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is called with another large buffer allocated on the stuck. If the data returned from the driver exceeds the initial allocated buffer, the one on the stuck will be copied to a new increased sized buffer in the ofbuf.
[/QUOTE]
I'm not sure I understand what you mean, neither that you understood my point. 
I know that in the netlink protocol implemented in userspace, you hold an ofbuf, which, when writing / creating the netlink message, it automatically grows (reallocates) when it sees that it needs more space.
I am not talking about writing a message, I am talking about reading.
Here, the "sequential reading" happens via dump_start, dump_next and dump_finished(). Where the dump_next() used to recv() more from buffer.
Look at how "nl_dump_start" is implemented in netlink-socket.c: it's a mere send. While "nl_dump_next" calls nl_dump_refill to recv() more from the kernel buffer.
You cannot do this dump() operation in a single DeviceIoControl request.
the linux netlink socket recv() and the windows file function ReadFile() both tell you only how much they succeed to read: you say "read max X" and they tell you back "I read Y <= X".
We have a situation where we say "read max X" but you have a buffer in kernel of possible size Z >X. it will tell you back "Ok, read that max X" but the dump will not be finished reading! You'll have to continue reading the file / receiving from socket until it gives you an error "no more data!".
This is something that CANNOT be implemented as DeviceIoControl.

[QUOTE]We should probably modify the implementation of nl_dump_recv() , so it will issue DeviceIOControl() call. In input buffer we need to pass down information about the "offset" of the current transaction so the driver will know from which position it needs to start the dump.
[/QUOTE]
I see absolutely no reason why we should implement the "read offset" of the buffer in userspace. This should better be opaque to the userspace. Like a stream: you tell it "read!" and it reads from where it left of. This is how you normally read from a file, and this is how you normally receive from a socket.
And this "offset" implemented in userspace will still not save you from keeping state, because an additional Read from the same buffer still requires a "state" be preserved in kernel, among IOCTL calls: the buffer itself.

[QUOTE]
My concern would be that we should not break the Netlink protocol as it was selected in order to maintain user/kernel mode interoperability.
[/QUOTE]
Only if we cannot break the netlink protocol: if we figure a way where we do not use netlink, but only in cases where the linux ovs doesn't use netlink (linux ovs uses some ioctl-s to set options on sockets, or gets some configs on netdevs, etc.)

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Friday, August 08, 2014 7:24 PM
To: Samuel Ghinet; Alin Serdean; dev at openvswitch.org; Rajiv Krishnamurthy; Ben Pfaff; Kaushik Guha; Ben Pfaff; Justin Pettit; Nithin Raju; Ankur Sharma; Linda Sun; Keith Amidon
Subject: RE: Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Hi Sam,
Please find inline.

>Do you mean we will no longer use nl_sock_transact_multiple in userspace for these DPIF transactions?
No, we will use nl_sock_transact_multiple, but nl_sock_transact_multiple will be implemented through the call of DeviceIOControl() system call rather than se series of WriteFile()/ReadFile() pairs.

>[QUOTE]>You mean, whenever, say, a Flow dump request is issued, in one 
>reply to give back all flows?> Not necessarily. I meant that the driver does not have to maintain the state of the dump command.
>Each dump command sent down to the driver would be self-contained. [/QUOTE] We currently have this in our implementation. The only thing 'left' would be the >fact that we provide all the output buffer for dump at once. The userspace can read sequentially from it. Unless there is a reason to write sequentially from the >kernel to the userspace, and wait for the userspace to read, I think that how we have this one is ok.
Sound good, We can leverage your implementation.

>[QUOTE]Yes, these are OVS events that are placed in a custom queue.
>There is a single Operating System event associated with the global socket which collects all OVS events.
>It will be triggered through a completion of a pending I/O request in the driver.[/QUOTE] I used to be a bit confused of your implementation in OvsEvent and >OvsUser. Perhaps this discussion would clarify a bit more things. :) Ok, so we'll hold OVERLAPPED structs in the kernel, as events. What kind of IRP requests would be >returned as "pending" in the kernel? Requests coming as "nl_sock_recv()" on the multicast groups?
>Will there be multiple multicast groups used? or all multicast operations would queue events on the same event queue, where all the events are read from the same >part of code in userspace?
Currently there are two IOCTLs implanted in the driver. One which just read an event from the event queue and is called synchronously. The other one which is used just for the purpose of signaling and it is always pended in the driver. Once the pended IRP is completed, the event in the overlapped structure is signaled and the user mode will read the event queue synchronously (through the call of nl_sock_recv().
nl_sock_recv() always returns immediately (it has a wait parameter but it is set to false).

>How exactly are events queued by the kernel associated with the userspace? I mean, how do you register a "nic connected" event so that when an event happens, >you know you need to update userspace data for a nic, not do something else. Would there be IDs stored in the OvsEvent structs that would specify what kind of >events they are? Would we also need context data associated with these events?
This should be no different than the current implementation. These events are generated when NDIS calls the switch extension callback on switch port creation/deletion/link up/down etc..

>[QUOTE]>However, I think we need to take into account the situation where the userspace might be providing a smaller buffer than it is the total to read. Also, I >think the "dump" mechanism requires it.
>I (want) to assume that each transaction is self-contained which means that the driver should not maintain a state of the transaction. Since, we will be using an IOCTL for that transaction the user mode buffer length will be specified in the command itself.
>All Write/Read dump pairs are replaced with a single IOCTL call.[/QUOTE] That still did not answer my question :) You mean to use a very large read buffer, so that >you would be able to read all in one single operation? I am more concerned here about flow dumps, because you may not know whether you need an 1024 bytes >buffer or an 10240 byes buffer, or an 102400 bytes buffer, or etc.
Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is called with another large buffer allocated on the stuck. If the data returned from the driver exceeds the initial allocated buffer, the one on the stuck will be copied to a new increased sized buffer in the ofbuf.


>So I do not see how a DeviceIoControl operation could do both the 'write' and the 'read' part for the dump.
We should probably modify the implementation of nl_dump_recv() , so it will issue DeviceIOControl() call. In input buffer we need to pass down information about the "offset" of the current transaction so the driver will know from which position it needs to start the dump.

>If you pass to the DeviceIoControl a buffer length = 8000, and the flow dump reply buffer is 32000 bytes, you need to do additional reads AND maintain state in the >kernel (e.g. offset in the kernel read buffer).
I would like to hold the "offset" in user mode (probably add a field in nl_dump structure for WIN32)

> [QUOTE]o) I believe we shouldn't use the netlink overhead (nlmsghdr, genlmsghdr, attributes) when not needed (say, when registering a KEVENT notification) , >and, if w>e choose not to use netlink protocol always, we may need a way to differentiate between netlink and non-netlink requests.
>Possible, as phase for optimization[/QUOTE] Not necessarily: if we can make a clear separation in code between netlink and non-netlink km-um, not using netlink >where we don't need to might save us some development & maintainability effort - both in kernel and in userspace. Because otherwise we'd need to turn non->netlink messages of (windows) userspace code into netlink messages.
My concern would be that we should not break the Netlink protocol as it was selected in order to maintain user/kernel mode interoperability.

Eitan




More information about the dev mailing list