[ovs-dev] [PATCH v2] datapath-windows: update DESIGN document

Nithin Raju nithin at vmware.com
Fri Nov 21 00:27:46 UTC 2014


In this patch, we update the design document to reflect the netlink
based kernel-userspace interface implementation and a few other changes.
I have covered at a high level.

Please feel free to extend the document with more details that you think
got missed out.

Signed-off-by: Nithin Raju <nithin at vmware.com>
---
 datapath-windows/DESIGN |  260 +++++++++++++++++++++++++++++-----------------
 1 files changed, 164 insertions(+), 96 deletions(-)

diff --git a/datapath-windows/DESIGN b/datapath-windows/DESIGN
index b438c44..638990d 100644
--- a/datapath-windows/DESIGN
+++ b/datapath-windows/DESIGN
@@ -1,20 +1,13 @@
                        OVS-on-Hyper-V Design Document
                        ==============================
-There has been an effort in the recent past to develop the Open vSwitch (OVS)
-solution onto multiple hypervisor platforms such as FreeBSD and Microsoft
-Hyper-V. VMware has been working on a OVS solution for Microsoft Hyper-V for
-the past few months and has successfully completed the implementation.
-
-This document provides details of the development effort. We believe this
-document should give enough information to members of the community who are
-curious about the developments of OVS on Hyper-V. The community should also be
-able to get enough information to make plans to leverage the deliverables of
-this effort.
-
-The userspace portion of the OVS has already been ported to Hyper-V and
-committed to the openvswitch repo. So, this document will mostly emphasize on
-the kernel driver, though we touch upon some of the aspects of userspace as
-well.
+There has been a community effort to develop Open vSwitch on Microsoft Hyper-V.
+In this document, we provide details of the development effort. We believe this
+document should give enough information to understand the overall design.
+
+The userspace portion of the OVS has been ported to Hyper-V in a separate
+effort, and committed to the openvswitch repo. So, this document will mostly
+emphasize on the kernel driver, though we touch upon some of the aspects of
+userspace as well.
 
 We cover the following topics:
 1. Background into relevant Hyper-V architecture
@@ -48,13 +41,13 @@ In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
 physical NIC on the Hyper-V extensible switch is attached via a port. Each port
 is both on the ingress path or the egress path of the switch. The ingress path
 is used for packets being sent out of a port, and egress is used for packet
-being received on a port. By design, NDIS provides a layered interface, where
-in the ingress path, higher level layers call into lower level layers, and on
-the egress path, it is the other way round. In addition, there is a object
-identifier (OID) interface for control operations Eg. addition of a port. The
-workflow for the calls is similar in nature to the packets, where higher level
-layers call into the lower level layers. A good representational diagram of
-this architecture is in [4].
+being received on a port. By design, NDIS provides a layered interface. In this
+layered interface, higher level layers call into lower level layers, in the
+ingress path. In the egress path, it is the other way round. In addition, there
+is a object identifier (OID) interface for control operations Eg. addition of
+a port. The workflow for the calls is similar in nature to the packets, where
+higher level layers call into the lower level layers. A good representational
+diagram of this architecture is in [4].
 
 Windows Filtering Platform (WFP)[5] is a platform implemented on Hyper-V that
 provides APIs and services for filtering packets. WFP has been utilized to
@@ -75,22 +68,23 @@ has been used to retrieve some of the configuration information that OVS needs.
                                   |                               |
   +------+ +--------------+       | +-----------+  +------------+ |
   |      | |              |       | |           |  |            | |
-  | OVS- | |     OVS      |       | | Virtual   |  | Virtual    | |
-  | wind | |  USERSPACE   |       | | Machine #1|  | Machine #2 | |
-  |      | |  DAEMON/CTL  |       | |           |  |            | |
+  | ovs- | |     OVS-     |       | | Virtual   |  | Virtual    | |
+  | *ctl | |  USERSPACE   |       | | Machine #1|  | Machine #2 | |
+  |      | |    DAEMON    |       | |           |  |            | |
   +------+-++---+---------+       | +--+------+-+  +----+------++ | +--------+
-  |  DPIF-  |   | netdev- |       |    |VIF #1|         |VIF #2|  | |Physical|
-  | Windows |<=>| Windows |       |    +------+         +------+  | |  NIC   |
+  |  dpif-  |   | netdev- |       |    |VIF #1|         |VIF #2|  | |Physical|
+  | netlink |   | windows |       |    +------+         +------+  | |  NIC   |
   +---------+   +---------+       |      ||                   /\  | +--------+
-User     /\                       |      || *#1*         *#4* ||  |     /\
-=========||=======================+------||-------------------||--+     ||
-Kernel   ||                              \/                   ||  ||=====/
-         \/                           +-----+                 +-----+ *#5*
+User     /\         /\            |      || *#1*         *#4* ||  |     /\
+=========||=========||============+------||-------------------||--+     ||
+Kernel   ||         ||                   \/                   ||  ||=====/
+         \/         \/                +-----+                 +-----+ *#5*
  +-------------------------------+    |     |                 |     |
  |   +----------------------+    |    |     |                 |     |
  |   |   OVS Pseudo Device  |    |    |     |                 |     |
- |   +----------------+-----+    |    |     |                 |     |
- |                               |    |  I  |                 |     |
+ |   +----------------------+    |    |     |                 |     |
+ |      | Netlink Impl. |        |    |     |                 |     |
+ |      -----------------        |    |  I  |                 |     |
  | +------------+                |    |  N  |                 |  E  |
  | |  Flowtable | +------------+ |    |  G  |                 |  G  |
  | +------------+ |  Packet    | |*#2*|  R  |                 |  R  |
@@ -110,9 +104,8 @@ Kernel   ||                              \/                   ||  ||=====/
 Figure 2 shows the various blocks involved in the OVS Windows implementation,
 along with some of the components available in the NDIS stack, and also the
 virtual machines. The workflow of a packet being transmitted from a VIF out and
-into another VIF and to a physical NIC is also shown. New userspace components
-being added as also shown. Later on in this section, we’ll discuss the flow of
-a packet at a high level.
+into another VIF and to a physical NIC is also shown. Later on in this section,
+we will discuss the flow of a packet at a high level.
 
 The figure gives a general idea of where the OVS userspace and the kernel
 components fit in, and how they interface with each other.
@@ -122,9 +115,11 @@ a forwarding extension roughly implementing the following
 sub-modules/functionality. Details of each of these sub-components in the
 kernel are contained in later sections:
  * Interfacing with the NDIS stack
+ * Netlink message parser
+ * Netlink sockets
  * Switch/Datapath management
  * Interfacing with userspace portion of the OVS solution to implement the
-   necessary ioctls that userspace needs
+   necessary functionality that userspace needs
  * Port management
  * Flowtable/Actions/packet forwarding
  * Tunneling
@@ -140,32 +135,36 @@ are:
  * Interface between the userspace and the kernel module.
  * Event notifications are significantly different.
  * The communication interface between DPIF and the kernel module need not be
-   implemented in the way OVS on Linux does.
+   implemented in the way OVS on Linux does. That said, it would be
+   advantageous to have a similar interface to the kernel module for reasons of
+   readability and maintainability.
  * Any licensing issues of using Linux kernel code directly.
 
 Due to these differences, it was a straightforward decision to develop the
 datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
-A re-development focussed on the following goals:
+A re-development focused on the following goals:
  * Adhere to the existing requirements of userspace portion of OVS (such as
-   ovs- vswitchd), to minimize changes in the userspace workflow.
+   ovs-vswitchd), to minimize changes in the userspace workflow.
  * Fit well into the typical workflow of a Hyper-V extensible switch forwarding
    extension.
 
 The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the code has already been ported and committed to
-the openvswitch repo. Most of the daemons such as ovs-vswitchd or ovsdb-server
-can run on Windows now. One additional daemon that has been implemented is
-called ovs-wind. At a high level ovs-wind manages keeps the ovsdb used by
-userspace in sync with the kernel state. More details in the userspace section.
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath
+effort.
 
 As explained in the OVS porting design document [7], DPIF is the portion of
-userspace that interfaces with the kernel portion of the OVS. Each platform can
-have its own implementation of the DPIF provider whose interface is defined in
-dpif-provider.h [3]. For OVS on Hyper-V, we have an implementation of DPIF
-provider for Hyper-V. The communication interface between userspace and the
-kernel is a pseudo device and is different from that of the Linux’s DPIF
-provider which uses netlink. But, as long as the DPIF provider interface is the
-same, the callers should be agnostic of the underlying communication interface.
+userspace that interfaces with the kernel portion of the OVS. The interface
+that each DPIF provider has to implement is defined in dpif-provider.h [3].
+Though each platform is allowed to have its own implementation of the DPIF
+provider, it was found, via community feedback, than it is desired to
+share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
+code with the DPIF provider on Linux. This interface is implemented in
+dpif-netlink.c, formerly dpif-linux.c.
+
+We'll elaborate more on kernel-userspace interface in a dedicated section
+below. Here it suffices to say that the DPIF provider implementation for
+Windows is netlink-based and shares code with the Linux one.
 
 2.a) Kernel module (datapath)
 -----------------------------
@@ -178,8 +177,8 @@ This is consistent with using a single datapath in the kernel on Linux. All the
 physical adapters are connected as external adapters to the extensible switch.
 
 When the OVS switch extension registers itself as a filter driver, it also
-registers callbacks for the switch management and datapath functions. In other
-words, when a switch is created on the Hyper-V root partition (host), the
+registers callbacks for the switch/port management and datapath functions. In
+other words, when a switch is created on the Hyper-V root partition (host), the
 extension gets an activate callback upon which it can initialize the data
 structures necessary for OVS to function. Similarly, there are callbacks for
 when a port gets added to the Hyper-V switch, and an External Network adapter
@@ -190,7 +189,7 @@ packet is received on an external NIC.
 As shown in the figures, an extensible switch extension gets to see a packet
 sent by the VM (VIF) twice - once on the ingress path and once on the egress
 path. Forwarding decisions are to be made on the ingress path. Correspondingly,
-we’ll be hooking onto the following interfaces:
+we will be hooking onto the following interfaces:
  * Ingress send indication: intercept packets for performing flow based
    forwarding.This includes straight forwarding to output ports. Any packet
    modifications needed to be performed are done here either inline or by
@@ -203,11 +202,41 @@ we’ll be hooking onto the following interfaces:
 
 Interfacing with OVS userspace
 ------------------------------
-We’ve implemented a pseudo device interface for letting OVS userspace talk to
+We have implemented a pseudo device interface for letting OVS userspace talk to
 the OVS kernel module. This is equivalent to the typical character device
-interface on POSIX platforms. The pseudo device supports a whole bunch of
+interface on POSIX platforms where we can register custom functions for read,
+write and ioctl functionality. The pseudo device supports a whole bunch of
 ioctls that netdev and DPIF on OVS userspace make use of.
 
+Netlink message parser
+----------------------
+The communication between OVS userspace and OVS kernel datapath is in the form
+of Netlink messages [1]. More details about this are provided in #2.c section,
+kernel-userspace interface. In the kernel, a full fledged netlink message
+parser has been implemented along the lines of the netlink message parser in
+OVS userspace. In fact, a lot of the code is ported code.
+
+On the lines of 'struct ofpbuf' in OVS userspace, a managed buffer has been
+implemented in the kernel datapath to make it easier to parse and construct
+netlink messages.
+
+Netlink sockets
+---------------
+On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
+messages. Since much of userspace code including DPIF provider in
+dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
+have been implemented in OVS userspace. As it is known, Windows lacks native
+netlink socket support, and also the socket family is not extensible either.
+Hence it is not possible to provide a native implementation of netlink socket.
+We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
+APIs to higher levels. The implementation opens a handle to the pseudo device
+for each netlink socket. Some more details on this topic are provided in the
+userspace section on netlink sockets.
+
+Typical netlink semantics of read message, write message, dump, and transaction
+have been implemented so that higher level layers are not affected by the
+netlink implementation not being native.
+
 Switch/Datapath management
 --------------------------
 As explained above, we hook onto the management callback functions in the NDIS
@@ -267,48 +296,83 @@ used.
 
 2.b) Userspace components
 -------------------------
-A new daemon has been added to userspace to manage the entities in OVSDB, and
-also to keep it in sync with the kernel state, and this include bridges,
-physical NICs, VIFs etc. For example, upon bootup, ovs-wind does a get on the
-kernel to get a list of the bridges, and the corresponding ports and populates
-OVSDB. If a new VIF gets added to the kernel switch because a user powered on a
-Virtual Machine, ovs-wind detects it, and adds a corresponding entry in the
-ovsdb. This implies that ovs-wind has a synchronous as well as an asynchronous
-interface to the OVS kernel driver.
-
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath
+effort.
+
+In this section, we cover the userspace components that interface with the
+kernel datapath.
+
+As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
+with Linux. The DPIF provider on Linux uses netlink sockets and netlink
+messages. Netlink sockets and messages are extensively used on Linux to
+exchange information between userspace and kernel. In order to satisfy these
+dependencies, netlink socket (pseudo and non-native) and netlink messages
+are implemented on Hyper-V.
+
+The following are the major advantages of sharing DPIF provider code:
+1. Maintenance is simpler:
+   Any change made to the interface defined in dpif-provider.h need not be
+   propagated to multiple implementations. Also, developers familiar with the
+   Linux implementation of the DPIF provider can easily ramp on the Hyper-V
+   implementation as well.
+2. Netlink messages provides inherent advantages:
+   Netlink messages are known for their extensibility. Each message is
+   versioned, so the provided data structures offer a mechanism to perform
+   version checking and forward/backward compatibility with the kernel
+   module.
+
+Netlink sockets
+---------------
+As explained in other sections, an emulation of netlink sockets has been
+implemented in lib/netlink-socket.c for Windows. The implementation creates a
+handle to the OVS pseudo device, and emulates netlink socket semantics of
+receive message, send message, dump, and transact. Most of the nl_* functions
+are supported.
+
+The fact that the implementation is non-native manifests in various ways.
+One example is that PID for the netlink socket is not automatically assigned in
+userspace when a handle is created to the OVS pseudo device. There's an extra
+command (defined in OvsDpInterfaceExt.h) that is used to grab the PID generated
+in the kernel.
+
+DPIF provider
+--------------
+As has been mentioned in earlier sections, the netlink socket and netlink
+message based DPIF provider on Linux has been ported to Windows.
+Correspondingly, the file is called lib/dpif-netlink.c now from its former
+name of lib/dpif-linux.c.
 
-2.c) Kernel-Userspace interface
--------------------------------
-DPIF-Windows
-------------
-DPIF-Windows is the Windows implementation of the interface defined in dpif-
-provider.h, and provides an interface into the OVS kernel driver. We implement
-most of the callbacks required by the DPIF provider. A quick summary of the
-functionality implemented is as follows:
- * dp_dump, dp_get: dump all datapath information or get information for a
-   particular datapath.  Currently we only support one datapath.
- * flow_dump, flow_put, flow_get, flow_flush: These functions retrieve all
-   flows in the kernel, add a flow to the kernel, get a specific flow and
-   delete all the flows in the kernel.
- * recv_set, recv, recv_wait, recv_purge: these poll packets for upcalls.
- * execute: This is used to send packets from userspace to the kernel. The
-   packets could be either flow miss packet punted from kernel earlier or
-   userspace generated packets.
- * vport_dump, vport_get, ext_info: These functions dump all ports in the
-   kernel, get a specific port in the kernel, or get extended information
-   about a port.
- * event_subscribe, wait, poll: These functions subscribe, wait and poll the
-   events that kernel posts.  A typical example is kernel notices a port has
-   gone up/down, and would like to notify the userspace.
+Most of the code is common. Some divergence is in the code to receive
+packets. The Linux implementation uses epoll() which is not natively supported
+on Windows.
 
 Netdev-Windows
 --------------
-We have a Windows implementation of the the interface defined in lib/netdev-
-provider.h. The implementation provided functionality to get extended
-information about an interface. It is limited in functionality compared to the
-Linux implementation of the netdev provider and cannot be used to add any
-interfaces in the kernel such as a tap interface.
+We have a Windows implementation of the interface defined in
+lib/netdev-provider.h. The implementation provides functionality to get
+extended information about an interface. It is limited in functionality
+compared to the Linux implementation of the netdev provider and cannot be used
+to add any interfaces in the kernel such as a tap interface or to send/receive
+packets. The netdev-windows implementation uses the datapath interface
+extensions defined in:
+datapath-windows/include/OvsDpInterfaceExt.h
 
+2.c) Kernel-Userspace interface
+-------------------------------
+openvswitch.h and OvsDpInterfaceExt.h
+-------------------------------------
+Since the DPIF provider is shared with Linux, the kernel datapath provides the
+same interface as the Linux datapath. The interface is defined in
+datapath/linux/compat/include/linux/openvswitch.h. Derivatives of this
+interface file are created during OVS userspace compilation. The derivative for
+the kernel datapath on Hyper-V is provided in the following location:
+datapath-windows/include/OvsDpInterface.h
+
+That said, there are Windows specific extensions that are defined in the
+interface file:
+datapath-windows/include/OvsDpInterfaceExt.h
 
 2.d) Flow of a packet
 ---------------------
@@ -354,9 +418,9 @@ driver.
 
 Reference list:
 ===============
-1: Hyper-V Extensible Switch
+1. Hyper-V Extensible Switch
 http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
-2: Hyper-V Extensible Switch Extensions
+2. Hyper-V Extensible Switch Extensions
 http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
 3. DPIF Provider
 http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-
@@ -369,3 +433,7 @@ http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
 http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
 7. How to Port Open vSwitch to New Software or Hardware
 http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
+8. Netlink
+http://en.wikipedia.org/wiki/Netlink
+9. epoll
+http://en.wikipedia.org/wiki/Epoll
-- 
1.7.4.1




More information about the dev mailing list