[ovs-dev] Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Eitan Eliahu eliahue at vmware.com
Thu Aug 7 16:35:09 UTC 2014

Hi Sam,
Here are some clarifications:

>o) "Transaction based DPIF primitive": does this mean that we do Reads and Writes here?
Transaction based DPIF primitives are mapped into synchronous  device I/O control system calls.
The NL reply would be returned in the output buffer of the IOCTL parameter.

>You mean, whenever, say, a Flow dump request is issued, in one reply to give back all flows?
Not necessarily. I meant that the driver does not have to maintain the state of the dump command.
Each dump command sent down to the driver would be self-contained. 

>o) "Event notification / NL multicast subscription"
>1. I understand you do not speak of events here as API waitable / notification events, right?
Yes, these are OVS events that are placed in a custom queue.
There is a single Operating System event associated with the global socket which collects all OVS events.
It will be triggered through a completion of a pending I/O request in the driver.

> what is the format of the structs that would be read from nl_sock_recv()?
The socket structure would contain a system Overlapped structure (along with an event).
The Overlapped structure would be used only for unicast and multicast subscription.
Transaction and dump based sockets will always not be waitable.

>2. What would the relationship between hyper-v ports and hyper-v nics and dp ports would be?
>I mean, in the sense that the dp port additions and deletions would be requests coming from the userspace to the kernel (so no notification needed), while we get OIDs when nics connect & disconnect. In this sense, I see the hyper-v nic connection and disconnection as something that could be >implemented as API notification events.
I assume that the above question is not related to the Netlink interface but I think your description is correct in general:
Hyper-V ports (unlike tunnel ports) are created by the Hyper-V. The driver gets notified on every port creation or delition (or attribute change). In turn the driver queues an OVS event to a global queue (which was initially created when a multicat subscription IOCTL was sent to driver). Then, the driver will complete the pending IRP associated with the event queue. The user mode thread wairing on the event (associated with the Overlapped structure for this socket) will wake up and subsequently a DP port operation would be excuted.

>o) "C. Implementation work flow"
>So our incremental development here would be:
>1. Add a new device (alongside the existing one) 2. Implement a netlink protocol (for basic parsing attributes, etc.) for the new device 3. Implement netlink datapath operations for this device (get and dump only) 

>4. further & more advanced things are to be dealt with later.
Event notification (multicast) and missed packet path (unicast) will be developed as a second phase. At this phase the FPID device object will be removed and the "new" vswitchd process will control the driver over the Netlink device interface

>If I understand what you mean, I think this is an implementation detail.
>Basically, for our driver, for unicast messages I know that we can do sequential reads. We hold an 'offset' in the buffer where the next read must begin from. However, as I remember, the implementation for "write" simply overwrites the previous buffer (of the corresponding socket). I believe it is good to >keep one-write then one-receive instead of doing all writes, then all receives.
>However, I think we need to take into account the situation where the userspace might be providing a smaller buffer than it is the total to read. Also, I think the "dump" mechanism requires it.
I (want) to assume that each transaction is self-contained which means that the driver should not maintain a state of the transaction. Since, we will be using an IOCTL for that transaction the user mode buffer length will be specified in the command itself. 
All Write/Read dump pairs are replaced with a single IOCTL call. As I understand transactions and dump are (as used for DPIF) are not really socket operation per se. 

>My suggestions & opinions:
>o) I think we must do dumping via writes and reads. The main reason is the fact that we don't know the total size to read when we request, say, a flow dump.
/* Receive a reply. */
error = nl_sock_recv__(sock, buf_txn->reply, false);
I am not familiar with ofpbuf structure. I noticed that you guys used MAX_STACK_LENGTH for specifying the buffer length. I need to get back to you on this one.

>o) I believe we shouldn't use the netlink overhead (nlmsghdr, genlmsghdr, attributes) when not needed (say, when registering a KEVENT notification) , and, if w>e choose not to use netlink protocol always, we may need a way to differentiate between netlink and non-netlink requests.
Possible, as phase for optimization

Thanks you Sam for reviewing these notes. Please feel free to ask or raise any comments.

More information about the dev mailing list