[ovs-dev] The way SDN controllers deal with MTU
ihrachys at redhat.com
Mon Jun 20 12:16:56 UTC 2016
(for those of you who read openstack-dev@, you may notice some duplication in this email comparing to the related thread: http://lists.openstack.org/pipermail/openstack-dev/2016-June/097189.html If that’s the case, sorry!)
tl;dr lots of Open vSwitch based SDN controllers plug devices that are meant to have different MTUs into the same ‘integration’ bridge (usually called br-int), and it sometimes makes MTU arrangements for those devices ineffective. Neutron team seeks guidance from Open vSwitch folks on how to proceed.
First thing, I’d like to note that when speaking about ‘Neutron' below, I implicitly mean ‘Neutron ML2/Open vSwitch reference implementation’. Though I believe same issues should affect other SDN solutions (OVN? dragonflow?) built on top of Open vSwitch that use a single integration bridge.
Now, let’s try to scope the problem. Neutron consistently uses a single bridge to plug all devices managed by a node. Those devices may belong to the same layer 2 domain ('network' in neutron-speak), as well as different layer 2 domains. Those domains may be implemented by using different encapsulation technologies, that in Neutron ML2 plugin case results in networks having different MTU values calculated for those networks. All devices that belong to a single network are supposed to use the network MTU. Those include virtualized interfaces inside VMs, as well as devices on the data path from VMs to the integration bridge. Meaning, for a typical Neutron Open vSwitch setup, the following devices are meant to carry the network MTU:
VM interface - tap device - ‘hybrid’ Linux bridge* - VETH pair => plugged into br-int.
(* used for iptables based firewall)
Now, Neutron (OpenStack Networking) and Nova (OpenStack Compute) components set relevant MTUs on all of those devices (except a VM interface, that is usually configured by the guest OS itself, based on information provided through DHCP/RA responses, or other means).
It all works as long as all devices we plug into br-int belong to networks with identical MTUs. But since Neutron allows for different MTUs, the assumption does not hold.
While Neutron indeed plugs devices that belong to different broadcast domains into the same switch, it does not mean to allow traffic that belong to different domains to be switched. (All inter-domain communication is handled by virtual routers that are implemented as network namespaces.) Isolation is achieved thru local vlan tagging. Quoting:
"All VM VIFs are plugged into the integration bridge. VM VIFs on a given virtual network share a common “local” VLAN (i.e. not propagated externally). The VLAN id of this local VLAN is mapped to the physical networking details realizing that virtual network.”
What it means is that while devices are plugged into the same bridge, due to the additional layer of isolation, Neutron effectively uses a single bridge as a set of switches, one per network participating in the bridge setup.
So back to MTU. When I boot a VM using a VXLAN backed network, the tap device of MTU=1450 is plugged into the br-int bridge, which lowers the bridge MTU to 1450. Then when I plug a device that belongs to a GRE network (MTU = 1458) into that same bridge, the GRE network backed device also gets its MTU reduced to 1450, and no ‘ip link’ commands allow to raise it to the intended MTU=1458.
Curiously, when I move the latter device into a network namespace and try to set MTU on that same device, it works. (Jiri Benc told me that it’s missing validation in vswitchd code that allows it). We actually utilized that magic in a fix in Neutron to make router devices (that are in a namespace) to get intended MTU values: https://review.openstack.org/#/c/327651/ where we now first move the device in a namespace, and only then set its MTU.
There are several issues with the Neutron patch. First, it relies on a bug in Open vSwitch. Second, it does not solve the problem for other devices that are plugged into br-int and that don’t belong to separate namespaces (which are all VM VIFs in OpenStack).
One idea that was mentioned to me by Jiri Benc is to reimplement Neutron bridge setup to use multiple bridges, one per network. In that way, there won’t be a need to have devices with different MTUs on the same integration bridge. Isolation between domains would also be simplified, because now we would not need to maintain any local VLAN tagging rules to isolate domains from each other; isolation would naturally happen, since now all connection paths between domains will have an L3 layer (namespace) on their road.
If we would start from scratch, it would probably be the best idea with little drawbacks. Sadly, we are looking at a huge number of setups that rely on a single bridge for multiple domains, and as I said before, it’s not just Neutron. Migrating those existing workloads to a new better bridge setup would be a huge pain, and I am not even sure whether it’s possible to replace them without full migration of workloads to other nodes. That’s a huge engineering work, and something that would need to happen in all affected SDN solutions.
One alternative to that could be kernel/vSwitch layer allowing to relax the ‘least of all device MTUs’ rule for some setups that explicitly ask for that. If only such an option would be available to SDN controllers, it could be utilized by them to be able to keep their existing single bridge setup.
And that’s the end of the story. So, what do you think of the problem? Is alternative proposed viable? If so, what’s the proper place for such configuration to exist - kernel or ovs?
I would be glad to find some solution that is acceptable by both Neutron as well as Open vSwitch communities, and something that we both can support in the long run.
More information about the dev