[ovs-dev] Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Eitan Eliahu eliahue at vmware.com
Fri Aug 8 16:24:40 UTC 2014

Hi Sam,
Please find inline.

>Do you mean we will no longer use nl_sock_transact_multiple in userspace for these DPIF transactions?
No, we will use nl_sock_transact_multiple, but nl_sock_transact_multiple will be implemented through the call of DeviceIOControl() system call rather than se series of WriteFile()/ReadFile() pairs.

>[QUOTE]>You mean, whenever, say, a Flow dump request is issued, in one reply to give back all flows?>
>Not necessarily. I meant that the driver does not have to maintain the state of the dump command.
>Each dump command sent down to the driver would be self-contained. [/QUOTE] We currently have this in our implementation. The only thing 'left' would be the >fact that we provide all the output buffer for dump at once. The userspace can read sequentially from it. Unless there is a reason to write sequentially from the >kernel to the userspace, and wait for the userspace to read, I think that how we have this one is ok.
Sound good, We can leverage your implementation.

>[QUOTE]Yes, these are OVS events that are placed in a custom queue.
>There is a single Operating System event associated with the global socket which collects all OVS events.
>It will be triggered through a completion of a pending I/O request in the driver.[/QUOTE] I used to be a bit confused of your implementation in OvsEvent and >OvsUser. Perhaps this discussion would clarify a bit more things. :) Ok, so we'll hold OVERLAPPED structs in the kernel, as events. What kind of IRP requests would be >returned as "pending" in the kernel? Requests coming as "nl_sock_recv()" on the multicast groups?
>Will there be multiple multicast groups used? or all multicast operations would queue events on the same event queue, where all the events are read from the same >part of code in userspace?
Currently there are two IOCTLs implanted in the driver. One which just read an event from the event queue and is called synchronously. The other one which is used just for the purpose of signaling and it is always pended in the driver. Once the pended IRP is completed, the event in the overlapped structure is signaled and the user mode will read the event queue synchronously (through the call of nl_sock_recv(). 
nl_sock_recv() always returns immediately (it has a wait parameter but it is set to false).

>How exactly are events queued by the kernel associated with the userspace? I mean, how do you register a "nic connected" event so that when an event happens, >you know you need to update userspace data for a nic, not do something else. Would there be IDs stored in the OvsEvent structs that would specify what kind of >events they are? Would we also need context data associated with these events?
This should be no different than the current implementation. These events are generated when NDIS calls the switch extension callback on switch port creation/deletion/link up/down etc..

>[QUOTE]>However, I think we need to take into account the situation where the userspace might be providing a smaller buffer than it is the total to read. Also, I >think the "dump" mechanism requires it.
>I (want) to assume that each transaction is self-contained which means that the driver should not maintain a state of the transaction. Since, we will be using an IOCTL for that transaction the user mode buffer length will be specified in the command itself.
>All Write/Read dump pairs are replaced with a single IOCTL call.[/QUOTE] That still did not answer my question :) You mean to use a very large read buffer, so that >you would be able to read all in one single operation? I am more concerned here about flow dumps, because you may not know whether you need an 1024 bytes >buffer or an 10240 byes buffer, or an 102400 bytes buffer, or etc.
Yes, I looked on this issue yesterday with help of Ben. It seems that the Dump function allocate an initial 1024 byte buffer. In turn nl_sock_receive__() is called with another large buffer allocated on the stuck. If the data returned from the driver exceeds the initial allocated buffer, the one on the stuck will be copied to a new increased sized buffer in the ofbuf.

>So I do not see how a DeviceIoControl operation could do both the 'write' and the 'read' part for the dump.
We should probably modify the implementation of nl_dump_recv() , so it will issue DeviceIOControl() call. In input buffer we need to pass down information about the "offset" of the current transaction so the driver will know from which position it needs to start the dump. 

>If you pass to the DeviceIoControl a buffer length = 8000, and the flow dump reply buffer is 32000 bytes, you need to do additional reads AND maintain state in the >kernel (e.g. offset in the kernel read buffer).
I would like to hold the "offset" in user mode (probably add a field in nl_dump structure for WIN32) 

> [QUOTE]o) I believe we shouldn't use the netlink overhead (nlmsghdr, genlmsghdr, attributes) when not needed (say, when registering a KEVENT notification) , >and, if w>e choose not to use netlink protocol always, we may need a way to differentiate between netlink and non-netlink requests.
>Possible, as phase for optimization[/QUOTE] Not necessarily: if we can make a clear separation in code between netlink and non-netlink km-um, not using netlink >where we don't need to might save us some development & maintainability effort - both in kernel and in userspace. Because otherwise we'd need to turn non->netlink messages of (windows) userspace code into netlink messages.
My concern would be that we should not break the Netlink protocol as it was selected in order to maintain user/kernel mode interoperability.


More information about the dev mailing list