[ovs-dev] [PATCH RFC ovn] Add VXLAN support for non-VTEP datapath bindings

Ihar Hrachyshka ihrachys at redhat.com
Fri Mar 27 18:14:19 UTC 2020

On Wed, Mar 25, 2020 at 1:03 PM Ihar Hrachyshka <ihrachys at redhat.com> wrote:
> On Mon, Mar 23, 2020 at 7:47 PM Ben Pfaff <blp at ovn.org> wrote:
> >
> > On Mon, Mar 23, 2020 at 06:39:14PM -0400, Ihar Hrachyshka wrote:
> > > First, some questions as to implementation (or feasibility) of several
> > > todo items in my list for the patch.
> > >
> > > 1) I initially thought that, because VXLAN would have limited space
> > > for both networks and ports in its VNI, the encap type would not be
> > > able to support as many of both as Geneve / STT, and so we would need
> > > to enforce the limit programmatically somehow. But in OVN context, is
> > > it even doable? North DB resources may be created before any chassis
> > > are registered; once a chassis that is VXLAN only joins, it's too late
> > > to forbid the spilling resources from existence (though it may be a
> > > good time to detect this condition and perhaps fail to register the
> > > chassis / configure flow tables). How do we want to handle this case?
> > > Do we fail to start VXLAN configured ovn-controller when too many
> > > networks / ports per network created? Do we forbid creating too many
> > > resources when a chassis is registered that is VXLAN only? Both? Or do
> > > we leave it up to the deployment / CMS to control the chassis / north
> > > DB configuration?
> > >
> > > 2) Similar to the issue above, I originally planned to forbid using
> > > ACLs relying on ingress port when a VXLAN chassis is involved (because
> > > the VNI won't carry the information). I believe the approach should be
> > > similar to how we choose to handle the issue with the maximum number
> > > of resources, described above.
> > >
> > > I am new to OVN so maybe there are existing examples for such
> > > situations already that I could get inspiration from. Let me know what
> > > you think.
> >
> > I don't have good solutions for the above resource limit problems.  We
> > designed OVN so that this kind of resource limit wouldn't be a problem
> > in practice, so we didn't think through what would happen if the limits
> > suddenly became more stringent.
> >
> > I think that it falls upon the CMS by default.
> >
> For ACLs, I think it's fair to put the burden on CMS (just because it
> should be easy for them to follow the simple rule: "Don't use ingress
> matching ACLs in your OVN driver.")
> While having a guard against overflowing resource number limits in CMS
> may be helpful (for example, for immediate failure mode feedback to
> CMS user - compare to async notification about a CMS resource to OVSDB
> primitive conversion),
> I believe OVN should handle the case too. The risk of not doing it is
> - the limits are reached, and we start to send traffic that belongs to
> one network to another, because their lower 12 bits of datapath ID are
> the same.
> While CMS could guard against that, it may be less aware about chassis
> configuration than OVN. A dumb way to resolve this in CMS would be
> having a global configuration option set by deployment tool that
> configures OVN and that would know whether any VXLAN capable chassis
> are deployed in the cluster. A more proper way to solve it would be to
> make CMS aware of chassis configuration by maintaining a cache of
> Chassis table records and checking their encap types on each network /
> port created.
> The same could be done by OVN itself, and arguably OVN is the owner of
> the data source (encap records) and is in a better position to control
> it:
> 1. on network creation, if VXLAN is enabled on any chassis, count
> networks; if result >= limit, fail; same for ports per network;
> 2. on ovn-controller start, if VXLAN is enabled for the chassis,
> calculate networks / ports per network; if result >= limit, fail to
> start the service.
> Note that in most common scenario, all chassis have the same
> encapsulation types registered; there are multiple ovn-controller
> nodes; and resources are created after all chassis are registered in
> the database. So point (2) above is to handle a corner case that
> probably won't ever happen in real life. (1) is a hot path.
> Any specific objections to having this kind of guards in OVN itself?
> This may be in addition to CMS side guards (to avoid even trying to
> create CMS resources that are known to fail to sync to OVN).
> (A similar approach may be extended to ACLs allowed though it's not as
> pressing because there are no known CMS that rely on unsupported
> ACLs.)

The more I think about the issue the more important it looks that OVN
is aware of VXLAN limitations and guards against overflowing the
number of resources. Here is why.

While CMS could relatively easy control the overall number of
resources in database - it should be aware of its own resource records
- it does not, in general case, control tunnel keys selected for
datapaths. Meaning, OVN allocates the IDs on Datapath_Binding
creation. OVN selects datapath IDs sequentially, starting from 1 up to
max value for the 24-bit ID, then wraps to the start. A problem with
this approach may occur when after a significant number of networks
were created and then deleted, the "next tunnel ID" counter moves to
the "edge" of 12-bit space available for unique VXLAN datapath
identifiers. Then once a new logical switch creation request is
submitted, OVN may allocate an ID that would have the same lower
12-bits of the new datapath ID as another existing switch (the final
24-bit datapath ID would be unique but that won't translate into a
unique ID passed to a remote hypervisor through VXLAN VNI due to the
proposed 12/12-bit split scheme).

This is probably a bit convoluted, so to give an example, consider
there is a network A with datapath ID = 0b000000000000000000000001.
When VXLAN is enabled, we truncate the datapath ID to 12-bits before
setting it to outgoing packet metadata. Then network B is created with
datapath ID = 0b000000000001000000000001. (Note two bits set.) This
unique datapath ID will map to the same 12-bit value when setting it
for the outgoing packet, making traffic from one network to flow to
another network.

Note that in this example, the number of switches in the database is
below the maximum number allowed for VXLAN (2^12). The only way CMS
could guard against this scenario is monitoring all tunnel keys
allocated to all datapaths and explicitly requesting tunnel keys when
creating new switches, doing it in a way that would not produce a
12-bit clash. (There is already the `requested-tnl-key` option for

It is not a good idea to offload tunnel key management onto CMS (or at
least it's not a good idea to assume that all CMS implement this
correctly, considering that the risk of not doing so has serious
tenant privacy and connectivity implications). My belief is OVN should
detect VXLAN enabled in cluster, in which case datapath ID range to
allocate to new switches would be halved. (2^24 -> 2^12) This would
involve additional database server work; specifically, ovsdb-server
would need to, on switch and port creation, detect VXLAN mode by
fetching (probably subscribing and caching) all chassis encaps and
checking if any have VXLAN enabled, and if so, adjust the maximum
allowed value for datapath IDs to 2^12.

Another issue that I initially haven't considered that is related to
the available space for port IDs is that I assumed 2^12 port IDs
available per network in the proposed solution; but I missed that OVN
allocates separate sub-range for multicast groups that occupies half
of the total range for port IDs. (The reserved multicast space is IDs
32768 through 65535.) Perhaps having 2^11 for unique port IDs is still
ok but since we already reduced the available limits pretty
significantly, this is something to keep in mind.

Let me know what you think.

More information about the dev mailing list