[ovs-dev] upstreaming datapath

Ben Pfaff blp at nicira.com
Tue Oct 20 16:50:50 UTC 2009


Stephen Hemminger <shemminger at vyatta.com> writes:

>> The main place where the openvswitch module is actively
>> incompatible with anything in the upstream kernel is the bridge
>> hook (br_handle_frame_hook).  My thought there is that this hook
>> should become per-net_device, so that the existing Linux bridge
>> and openvswitch can coexist in a single system (which is useful,
>> and yet OVS can't support it right now).  Does that make sense to
>> you?
>
> There is active discussion on netdev, of providing a better hook
> for incoming packet redirection. PF_RING needs it and the existing
> open coded stack of hooks is awkward.

It looks to me like that discussion ended with David Miller
refusing to apply the patch because it would make it too easy to
hook in non-GPL'd network stacks.  (And I don't see it in
net-next.)

> My plan is to replace this with a new hook chain similar to existing
> protocol hooks, and make bridge/bond/macvlan and OVS and PF_RING
> use it. 

That would work fine, of course, if davem will accept it.

>> We'd also need to decide on a sysfs interface for openvswitch.
>> Currently the code emulates the existing bridge's sysfs
>> interface, because we needed compatibility, but clearly it's not
>> completely suitable and we should design something better.
>
> I prefer the netlink API used by vlan's with some additions
> for the control interface. sysfs is handy as a side door interface
> for shell scripts and parameter tweeking

Right, that's all I had in mind.

>> What kind of unified interface do you have in mind?  I can
>> imagine using the same netlink calls for, say, adding and
>> removing bridges and ports.  But both the existing bridge and the
>> openvswitch also have functionality that the other does not.  It
>> would not make sense to try to shoehorn both into exactly the
>> same interface.
>
> There is a netlink interface used for macvlan, vlan, gre and veth.
> It is missing support for bridge and bonding. The idea is to
> add types for adding, deleting and modifying slaves and parameters.

Makes sense.  This would fit in fine.

>> Initially (I think that this was so long ago that it is not in
>> our current Git tree), Open vSwitch used Netlink entirely for
>> communication with userspace (whereas now it uses character
>> devices).  But this proved not to work well for transactional
>> operations that are not idempotent, because responses to Netlink
>> messages can get lost.  For example, Open vSwitch has a datapath
>> operation to delete a flow and return its statistics.  When this
>> was implemented as a Netlink request and response, it was
>> possible for the response to get lost (because a kernel memory
>> allocation failed).  But re-sending the request would not work,
>> because the first command had deleted the flow.  And breaking it
>> into two separate commands (get flow stats, delete flow)
>> introduces a race where statistics on packets that arrive between
>> the commands are lost.  This is the main reason that we are not
>> using Netlink now.  I think there were other reasons, too, but
>> that is the one that comes to mind first.
>>
>
> Netlink will not drop responses to message, the only case where
> messages can get lost is when it is used for monitoring. The normal
> usage of request/response (even for dumping large tables), is supposed
> to be guaranteed.

I know that this is the case for table dumps (NLM_F_DUMP) because
of the repeated callback design but I don't think it is
guaranteed for anything else.  netlink_unicast(), which is used
as the basis for all netlink responses, can fail due to full
socket buffers or memory allocation failures.  When a process
that invokes ioctl provides a pointer to a preallocated userspace
buffer, you don't hit that particular problem (and avoid a copy
step too).

Maybe it's worth pointing out that while the existing Linux
bridge code is largely "hands off" (that is, you add devices as
ports and maybe set a few options), the Open vSwitch datapath
requires a lot of userspace interaction (every new flow has to be
set up in the kernel).  So there's more incentive to build an
efficient path to the kernel, at least for the operations that
are executed often.

I don't have benchmarks comparing Netlink vs. ioctl, though.

One nice thing about Netlink is that it is naturally extensible.

>> But the biggest reason that we have not already submitted OVS for
>> inclusion is this one: currently the interface is not flexible
>> and not extensible.  In particular, beyond the L2 Ethernet
>> header, it can only match IPv4 packets.  I have some thoughts on
>> how to make it more flexible and extensible, but I have not had
>> time to work any of it out in detail or to start writing code for
>> it.
>
> Wild idea would be to build off of nftables state machine engine.
> It is better not to have to build full protocol possibilities in the
> kernel, that is why Patrick is working on nftables as a long term
> replacement for iptables.

nftables looks really cool, but I'm concerned about its
performance.  The Open vSwitch datapath supports up to millions
of flows and its performance is as good as the existing bridge in
our tests.  That's because each packet just gets a flow extracted
in a simple way and then the result is looked up in a hash table.
I don't know enough about nftables to say whether it might be
able to perform as well as this.  If it's based on the idea of
executing statements linearly (it seems to be so) then at least
it would require a good bit of work.  (I'm not personally averse
to that, if it seems the best way to go.)

There is a tiny bit of performance testing in the paper at:
        http://openvswitch.org/papers/hotnets2009.pdf
The graph on page 5 is the main interesting bit.

I'm going to be off at HotNets tomorrow through Sunday, so my
email replies will probably become spottier than usual.




More information about the dev mailing list