[ovs-dev] Design notes for provisioning Netlink interface from the OVS Windows driver (Switch extension)

Nithin Raju nithin at vmware.com
Wed Aug 6 19:58:05 UTC 2014

Thanks so much for writing this up. This should clarify the questions that the folks had during the IRC meeting.

Pls. feel free to send out a writeup if you have anything to discuss regarding the changes in dpif-linux.c. If not, if you can cleanup dpif-linux.c, and submit it with the changes/interface that was working with the Cloudbase kernel implementation, that should also be a major step forward.

We can take up how to make the changes in dpif-linux.c to fit the (efficient) I/O model that Eitan has described.


On Aug 6, 2014, at 11:15 AM, Eitan Eliahu <eliahue at vmware.com>

> Hello all,
> Here is a summary of our initial design. Not all areas are covered so we would be glad  to discuss anything listed here and any other code/features we could leverage.
> Thanks!
> Eitan
> A. Objectives:
> [1] Create a NetLink (NL) driver interface for Windows which interoperates with
>    the OVS NL user mode.
> [2] User mode code should be mostly cross platform with some minimal changes to 
>    support specific Windows OS calls.
> [3] The Driver should not have to maintain a state or resources for transaction
>    or dumps
> [4] Reduce the number of system calls: User mode NL code should use Device IOCTL
>    system call to send an NL commands and to receive the associated NL reply
> 	in the same system call, whenever possible (*).
> [5] An event may be associated with a NL socket I/O request to signal a 
>    completion for an outstanding receive operation on the socket. 
> 	(For simplicity a single outstanding I/O request could be associated with
> 	a socket for the signaling purpose)
> (*) We assume Multiple NL transactions for the same socket can never be 
>    interleaved   
> B. Netlink operation types:
> There are four types of interactions carried by processes through the NL layer:
> [1] Transaction based DPIF primitives: these DPIF commands are mapped to 
>    nl_sock_transact NL interface to  nl_sock_transact_multiple. The transaction 
> 	based command creates an ad hoc socket and submits a synchronous device 
> 	I/O to the driver. The driver constructs the NL reply and copies it to the
> 	output 	buffer of the IRP representing the I/O transaction.
>    (Provisioning of transaction based command can be brought up and exercised 
> 	 through the ovs-dpctl command in parallel to the exsisting DPIF device)
> [2] State aware DPIF Dump commands: port and flow dump calls the following NL 
>    interfaces:
>    a) nl_dump_start()
>    b) nl_dump_next()
>    c) nl_dump_done() 
> 	With the exception of nl_dump_start these NL primitives are based on a
> 	synchronous	IOCTL system call rather than Write/Read. Thus, the driver
> 	does not have to maintain any dump transaction outstanding request nor 
> 	need to allocate any resources for it.
> [3] UpCall Port/PID/Unicast socket: 
>    The driver maintains per socket queue for all packets which have no 
> 	matching flow in the flow table. The socket has a single overlapped (event)
> 	structure which will be signalled through a completion of a pending I/O 
> 	request sent by user mode on subscription (similar to the current 
> 	implementation). When dpif_recv_wait is called, the event associated with 
>        the pending I/O request is passed poll_fd_wait_event inorder to wake the
>        thread which polls the port queue.
> 	dpif_recv calls nl_socket_recv which in turn drains the queue 
> 	maintained by the kernel in a synchronous fashion (through the use of 
> 	system ioctl call). The overlapped structure is rearmed when the recv_set 
> 	DPIF callback function is called.
> [4] Event notification / NL multicast subscription:
>    An event (such as port addition/deletion link up/down) are propagated from
> 	the kernel to user mode through a subscription of a socket to a multicast 
> 	group (nl_sock_join_mcgroup()) and a synchronous Receive (nl_sock_recv()) 
> 	for retrieving the events. The driver maintains a single event queue for
> 	all events. Similar to the UpCall mechanism, a user mode process keeps an 
> 	outstanding I/O request in the driver which is triggered whenever a new 
> 	event is generated. The event associated with the overlapped structure of
> 	the socket is passed to poll_fd_wait_event() whenever dpif_port_poll_wait()
> 	callback function is called. dpif_poll() will drain the event queue through 
> 	the call of nl_sock_recv().
> C. Implementation work flow:
> The driver creates a device object which provides a NetLink interface  for user 
> mode processes. During the development phase this device is created in addition 
> to the existing DPIF device. (This means that the bring-up of the NL based user 
> mode can be done on a live kernel with resident DPs, ports and flows) 
> All transaction and dump based DPIF functions could be developed and brought up 
> when the NL device is a secondary device (ovs-dpctl show and dump XXX should 
> work). After the initial phase is completed (i.e. all transaction and dump based 
> DPIF primitives are implemented), the original device interface will be removed 
> and packet and event propagation path will be brought up (driven by vswicth.exe)
> [1] Socket creation
>    Since PID should be allocated on a system wide basis and unique across all processes, the kernel
>    assigns the PID for a newly created socket. A new IOCTL command OVS_GET_PID returns the PID to a user
>    mode client to be associated with the socket.  
> [2] Detailed description
>    nl_sock_transact_multiple() which calls into a series of nl_sock_send__()
>    and nl_sock_recv__(). These can be implemented using ReadFile() and WriteFile()
>    or an ioctl modeled on a transaction which does both read and write. One thing
>    though is that, nl_sock_transact_multiple() might have to be modified to the
>    series of nl_sock_send__() and nl_sock_recv__(), rather than doing a bunch of
>    sends first and then doing the recvs. This is because Windows may not preserve
>    message boundaries when we do the recv.

More information about the dev mailing list