[ovs-dev] [PATCH] VxLAN-gpe implementation

Wed Jun 8 22:20:16 UTC 2016

On 08.06.2016 23:21, Alexander Duyck wrote:
> On Wed, Jun 8, 2016 at 12:46 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>> On 08.06.2016 17:38, Alexander Duyck wrote:
>>> On Wed, Jun 8, 2016 at 7:48 AM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>>> On 08.06.2016 14:51, Jiri Benc wrote:
>>>>> On Mon, 6 Jun 2016 14:22:58 -0700, Jesse Gross wrote:
>>>>>> On Sat, Jun 4, 2016 at 6:39 AM, Yi Yang <yi.y.yang at intel.com> wrote:
>>>>>> [...]
>>>>>>>  datapath/vport-netdev.c                           |   3 +-
>>>>>>>  datapath/vport-vxlan.c                            |  17 ++-
>>>>>>
>>>>>> These changes aren't upstream yet. Please do that before backporting them here.
>>>>>>
>>>>>> However, the changes to vport-vxlan.c are modifying compatibility code
>>>>>> that shouldn't be extended further. Instead, just use the existing
>>>>>> VXLAN netlink interfaces that have already been created to enable
>>>>>> these features.
>>>>>>
>>>>>> There is also a number of other patches to the OVS kernel module/VXLAN
>>>>>> that have not been backported. Pravin started doing this work but it
>>>>>> hasn't been applied yet. In general, I think it makes sense to
>>>>>> backport patches in order so that the diffs of the patches match those
>>>>>> of upstream.
>>>>>>
>>>>>> Finally, I have a question about receive side offloading with
>>>>>> VXLAN-gpe. This is primarily an upstream issue but is present in the
>>>>>> code being backported here as well. The VXLAN code sets up receive
>>>>>> offloads for all ports regardless of whether they are classic VXLAN or
>>>>>> L2/L3 GPE and expects NICs to parse the packets. I don't think this is
>>>>>> safe because there are a number of NICs out there that predate the
>>>>>> existence of GPE and therefore won't do this parsing correctly. I
>>>>>> think that it is necessary to disable receive offloading for
>>>>>> non-Ethernet VXLAN-GPE unless the offloading interface is extended.
>>>>>
>>>>> Coincidentally, I was talking about this with Hannes a few days ago.
>>>>> I'm adding him to CC.
>>>>>
>>>>> I guess you're referring to ndo_add_vxlan_port, right? I agree that
>>>>> this interface needs changes, especially considering that we know
>>>>> whether the UDP port belongs to VXLAN or VXLAN-GPE. But from my
>>>>> understanding of how drivers use this callback, the worst thing that
>>>>> could happen is suboptimal generation of rx hashes and thus steering
>>>>> the packets to a different receive queue than in the optimal case.
>>>>> Surely something to fix but it seems it won't cause much functional
>>>>> troubles with the current code?
>>>>
>>>> I am not sure if we must upgrade the interface. Can't drivers always
>>>> configure vxlan-gpe and are always backwards compatible?
>>>>
>>>> Non vxlan-gpe capable hardware would have to abort checksum offloading
>>>> as soon as they can't interpret the vxlan header anyway, so the packets
>>>> end up on the slow path and nothing bad should happen.
>>>>
>>>> Possibly some hardware will verify inner checksums despite it could not
>>>> understand the vxlan header completely. In this case we probably will
>>>> drop the packet in the driver. Anyway, I would be in favor to simply
>>>> present one knob, namely vxlan-offloading, to the user, instead a knob
>>>> for each version of vxlan.
>>>>
>>>> Unfortunately I couldn't get a definitive answer from the specs
>>>> regarding the checksuming details.
>>>>
>>>> Bye,
>>>> Hannes
>>>
>>> This is starting to sound like the same conversation we had on netdev
>>> when the ndo_add_geneve_port was added.  One easy fix for guaranteeing
>>> that we can perform the checksum offload is to just enable the outer
>>> UDP checksum.  Then we can still perform GRO and use the outer UDP
>>> source port for generating a hash.  If possible we should make this
>>> the default for all new UDP based tunnels going forward simply because
>>> it allows for backwards-compatibility with existing offloads.
>>
>> Yes, but this time we are only in vxlan-versioning-world only. :)
> 
> Right.  It leads me back into the original thought I had which was we
> should be providing the UDP tunnel type via something like an
> enumerated type so drivers could just opt in if they see a tunnel type
> they recognize.  Having to add a function for each new tunnel type is
> just silly.  Then we could have support for VXLAN be separate from
> VXLAN-GPE without having to add a whole new set of functions.

Partially agreed. I hope we can soon provide a patch that allows us to
query the state of offloading of networking cards, a la ethtool. My idea
was to make this depending on if the ndo_op for vxlan or geneve isn't NULL.

Right now we still have the problem, that the check if the card actually
supports udp offloading for vxlan or geneve is not depending on if the
ndo op is available but can be checked in the specific handler. This
should be factored out and I don't see a reason why we shouldn't convert
to enum style.

>> Hmm, it is a good question why we didn't do that already. Do you
>> remember the reasons?
> 
> Nope.  Although back when we last had this discussion LCO and GSO
> partial didn't exist.  Now that they do we should probably just take
> advantage of what we have since then we can get offloads automatically
> for almost every new tunnel protocol assuming the NICs aren't relying
> on parsing the frames in hardware/firmware for such an offload.

I fear that most NICs are actually relying on parsing the frames in
hardware/firmware. Isn't the only kind of modern networking card with
full CHECKSUM_PARTIAL support the e1000(e) in my laptop? :)

> Also I know there are concerns about regressions on the older tunnel
> protocols such as VXLAN and GENEVE since there is already support out
> there in hardware and it could hurt performance to enable the outer
> checksum if you are talking with an older kernel, but for newer
> protocols such as VXLAN-GPE we can probably get away with just
> changing the default early.  It is just a matter of finding a good
> place to do that.  The problem there is the way way we specify
> checksums for OVS right now is flag based if I recall and I haven't
> found a good way to deal with it without possibly introducing
> regressions on older kernels for the VXLAN and GENEVE tunnels.

We might be able to feed infos from the offload back to the tunnel
config, maybe. On Intel cards I don't see a reason why the outer
checksum should hurt. Especially, if the outer checksum has been
validated, we should be able to ignore the rest? If the opposite side is
hurt by that, hmm, this could be handled by the OAM part of vxlan-gpe. ;)

>>> Really I wonder if we shouldn't just start pulling out all the support
>>> for the ndo_add_vxlan/geneve_port code anyway, or at least start
>>> forcing it to include more data such as an endpoint IP/IPv6 address
>>
>> As far as I know Intel cards don't support adding UDP tuples but only
>> simple ports to the list of udp protocol offloads.
> 
> Right.  That is one of the ones I am kind of grumbling about.  Really
> it is a poor design prone to easily having false positives triggered.

Totally agreed, but I fear we have to deal with the hardware at hand. :/

>> On the other side, I think the current way how we do it is quite
>> okayish. At the same time where we establish the udp offload we
>> unconditionally add the socket to the wildcard socket table (local_ip ==
>> 0). At least no other socket can bind to such a socket (maybe
>> IP_FREEBIND etc. are an exception here).
> 
> Right.  It is the "ish" part that bugs me.  I don't like us doing
> things like leaking stuff into namespaces between PF/VF boundaries.
> With a little poking and prodding it becomes pretty easy for a VF to
> figure out how the PF is configured in terms of what ports are being
> offloaded.  Worse yet if hardware has exploits that can be targeted
> based on offloading certain protocols.

I fully agree, stateless offloads definitely win over all the approaches
which most cards implement so far. I would be very happy if we can get
away with those specific tunneling offload handlers.

>>> and tunnel type.  There end up being a number of cases where the
>>> offload could end up asserting itself where it isn't actually wanted
>>> such as a mixed environment where the PF is setting up the offloads
>>> via ndo_add_vxlan_port and then trying to apply that to VFs or other
>>> PFs on the same device which aren't operating in the same network
>>> namespace.  The worst case ends up being that you have UDP checksum
>>> offloads stripped for flows that would have otherwise been reporting
>>> valid checksum because the interfaces think they have tunnels with an
>>> invalid inner checksum, or not L4 checksum at all.
>>
>> Okay, if the networking card mixes up offloads across vf interfaces we
>> certainly can't trust them. Is this the case or did I misunderstood your
>> example?
> 
> I think you got the idea.  One of my gripes is if I use an i40e based
> NIC for instance any offloads I configure on the PF affect the VFs.
> That is fine when they are all in the same namespace.  If I move a VF
> into a guest or a separate namespace though that shouldn't happen and
> it does because many of these offloads are per device.  The same
> problem occurs with fm10k, only it is worse in that case because the
> port configuration is global to the switch if I recall correctly.

$#!%$?! :)

That means it also affects VMs by design? This is absolutely
unacceptable. For e.g. fm10k, if I configure a vxlan offloaded udp port
on a vf, it will propagate to the pf? This can't be used in production
then, I fear.

>> Otherwise regarding the vxlan offloading config on the cards, I think we
>> are fine right now. Even during namespace change of an device we
>> correctly flush the ports from the firmware.
> 
> Right.  If these were single interface cards I wouldn't really have an
> issue, however the fact is most of them aren't.  They almost all
> support multiple ports and SR-IOV and in many cases the firmware makes
> these offload ports global to the device.  What we really should have
> is something like a N-tuple type filter that uses a combination of
> destination MAC address, IP, destination UDP port, and maybe even
> physical port on the device that could be used to identify where we
> want the offloads and where we don't.  Right now it seems like most of
> the vendors are playing fast and loose with the offloads and applying
> them to Tx and Rx which makes a real mess of things.  On top of that
> many only support one port which makes things even worse because then
> we have a mix of traffic where some is offloaded and some isn't.

True. If I recall correctly, only i40e has 16 offloaded ports, most
others just one. Could you imagine that we have networking cards that
offloaded the vxlan-gpe standardized ports by default?

Okay, with outer checksums we probably can circumvent most of the
problems with those protocol specific offloads.

The remaining problem regarding offloads would be, that we by default
get into the situation that without the special offloading rule the
vxlan stream will only be processed on one single core, as we tell
network cards not to hash the udp ports into rxhash, which hurts a lot
in case of vxlan, where we bias the flow identification on the source
port without offloading available.

Bye,
Hannes