[ovs-dev] [PATCH] VxLAN-gpe implementation

Alexander Duyck alexander.duyck at gmail.com
Wed Jun 8 21:21:24 UTC 2016


On Wed, Jun 8, 2016 at 12:46 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
> On 08.06.2016 17:38, Alexander Duyck wrote:
>> On Wed, Jun 8, 2016 at 7:48 AM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>> On 08.06.2016 14:51, Jiri Benc wrote:
>>>> On Mon, 6 Jun 2016 14:22:58 -0700, Jesse Gross wrote:
>>>>> On Sat, Jun 4, 2016 at 6:39 AM, Yi Yang <yi.y.yang at intel.com> wrote:
>>>>> [...]
>>>>>>  datapath/vport-netdev.c                           |   3 +-
>>>>>>  datapath/vport-vxlan.c                            |  17 ++-
>>>>>
>>>>> These changes aren't upstream yet. Please do that before backporting them here.
>>>>>
>>>>> However, the changes to vport-vxlan.c are modifying compatibility code
>>>>> that shouldn't be extended further. Instead, just use the existing
>>>>> VXLAN netlink interfaces that have already been created to enable
>>>>> these features.
>>>>>
>>>>> There is also a number of other patches to the OVS kernel module/VXLAN
>>>>> that have not been backported. Pravin started doing this work but it
>>>>> hasn't been applied yet. In general, I think it makes sense to
>>>>> backport patches in order so that the diffs of the patches match those
>>>>> of upstream.
>>>>>
>>>>> Finally, I have a question about receive side offloading with
>>>>> VXLAN-gpe. This is primarily an upstream issue but is present in the
>>>>> code being backported here as well. The VXLAN code sets up receive
>>>>> offloads for all ports regardless of whether they are classic VXLAN or
>>>>> L2/L3 GPE and expects NICs to parse the packets. I don't think this is
>>>>> safe because there are a number of NICs out there that predate the
>>>>> existence of GPE and therefore won't do this parsing correctly. I
>>>>> think that it is necessary to disable receive offloading for
>>>>> non-Ethernet VXLAN-GPE unless the offloading interface is extended.
>>>>
>>>> Coincidentally, I was talking about this with Hannes a few days ago.
>>>> I'm adding him to CC.
>>>>
>>>> I guess you're referring to ndo_add_vxlan_port, right? I agree that
>>>> this interface needs changes, especially considering that we know
>>>> whether the UDP port belongs to VXLAN or VXLAN-GPE. But from my
>>>> understanding of how drivers use this callback, the worst thing that
>>>> could happen is suboptimal generation of rx hashes and thus steering
>>>> the packets to a different receive queue than in the optimal case.
>>>> Surely something to fix but it seems it won't cause much functional
>>>> troubles with the current code?
>>>
>>> I am not sure if we must upgrade the interface. Can't drivers always
>>> configure vxlan-gpe and are always backwards compatible?
>>>
>>> Non vxlan-gpe capable hardware would have to abort checksum offloading
>>> as soon as they can't interpret the vxlan header anyway, so the packets
>>> end up on the slow path and nothing bad should happen.
>>>
>>> Possibly some hardware will verify inner checksums despite it could not
>>> understand the vxlan header completely. In this case we probably will
>>> drop the packet in the driver. Anyway, I would be in favor to simply
>>> present one knob, namely vxlan-offloading, to the user, instead a knob
>>> for each version of vxlan.
>>>
>>> Unfortunately I couldn't get a definitive answer from the specs
>>> regarding the checksuming details.
>>>
>>> Bye,
>>> Hannes
>>
>> This is starting to sound like the same conversation we had on netdev
>> when the ndo_add_geneve_port was added.  One easy fix for guaranteeing
>> that we can perform the checksum offload is to just enable the outer
>> UDP checksum.  Then we can still perform GRO and use the outer UDP
>> source port for generating a hash.  If possible we should make this
>> the default for all new UDP based tunnels going forward simply because
>> it allows for backwards-compatibility with existing offloads.
>
> Yes, but this time we are only in vxlan-versioning-world only. :)

Right.  It leads me back into the original thought I had which was we
should be providing the UDP tunnel type via something like an
enumerated type so drivers could just opt in if they see a tunnel type
they recognize.  Having to add a function for each new tunnel type is
just silly.  Then we could have support for VXLAN be separate from
VXLAN-GPE without having to add a whole new set of functions.

> Hmm, it is a good question why we didn't do that already. Do you
> remember the reasons?

Nope.  Although back when we last had this discussion LCO and GSO
partial didn't exist.  Now that they do we should probably just take
advantage of what we have since then we can get offloads automatically
for almost every new tunnel protocol assuming the NICs aren't relying
on parsing the frames in hardware/firmware for such an offload.

Also I know there are concerns about regressions on the older tunnel
protocols such as VXLAN and GENEVE since there is already support out
there in hardware and it could hurt performance to enable the outer
checksum if you are talking with an older kernel, but for newer
protocols such as VXLAN-GPE we can probably get away with just
changing the default early.  It is just a matter of finding a good
place to do that.  The problem there is the way way we specify
checksums for OVS right now is flag based if I recall and I haven't
found a good way to deal with it without possibly introducing
regressions on older kernels for the VXLAN and GENEVE tunnels.

>> Really I wonder if we shouldn't just start pulling out all the support
>> for the ndo_add_vxlan/geneve_port code anyway, or at least start
>> forcing it to include more data such as an endpoint IP/IPv6 address
>
> As far as I know Intel cards don't support adding UDP tuples but only
> simple ports to the list of udp protocol offloads.

Right.  That is one of the ones I am kind of grumbling about.  Really
it is a poor design prone to easily having false positives triggered.

> On the other side, I think the current way how we do it is quite
> okayish. At the same time where we establish the udp offload we
> unconditionally add the socket to the wildcard socket table (local_ip ==
> 0). At least no other socket can bind to such a socket (maybe
> IP_FREEBIND etc. are an exception here).

Right.  It is the "ish" part that bugs me.  I don't like us doing
things like leaking stuff into namespaces between PF/VF boundaries.
With a little poking and prodding it becomes pretty easy for a VF to
figure out how the PF is configured in terms of what ports are being
offloaded.  Worse yet if hardware has exploits that can be targeted
based on offloading certain protocols.

>> and tunnel type.  There end up being a number of cases where the
>> offload could end up asserting itself where it isn't actually wanted
>> such as a mixed environment where the PF is setting up the offloads
>> via ndo_add_vxlan_port and then trying to apply that to VFs or other
>> PFs on the same device which aren't operating in the same network
>> namespace.  The worst case ends up being that you have UDP checksum
>> offloads stripped for flows that would have otherwise been reporting
>> valid checksum because the interfaces think they have tunnels with an
>> invalid inner checksum, or not L4 checksum at all.
>
> Okay, if the networking card mixes up offloads across vf interfaces we
> certainly can't trust them. Is this the case or did I misunderstood your
> example?

I think you got the idea.  One of my gripes is if I use an i40e based
NIC for instance any offloads I configure on the PF affect the VFs.
That is fine when they are all in the same namespace.  If I move a VF
into a guest or a separate namespace though that shouldn't happen and
it does because many of these offloads are per device.  The same
problem occurs with fm10k, only it is worse in that case because the
port configuration is global to the switch if I recall correctly.

> Otherwise regarding the vxlan offloading config on the cards, I think we
> are fine right now. Even during namespace change of an device we
> correctly flush the ports from the firmware.

Right.  If these were single interface cards I wouldn't really have an
issue, however the fact is most of them aren't.  They almost all
support multiple ports and SR-IOV and in many cases the firmware makes
these offload ports global to the device.  What we really should have
is something like a N-tuple type filter that uses a combination of
destination MAC address, IP, destination UDP port, and maybe even
physical port on the device that could be used to identify where we
want the offloads and where we don't.  Right now it seems like most of
the vendors are playing fast and loose with the offloads and applying
them to Tx and Rx which makes a real mess of things.  On top of that
many only support one port which makes things even worse because then
we have a mix of traffic where some is offloaded and some isn't.

- Alex



More information about the dev mailing list