[ovs-dev] [PATCH] VxLAN-gpe implementation

Thu Jun 9 02:33:23 UTC 2016

On Wed, Jun 8, 2016 at 3:20 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
> On 08.06.2016 23:21, Alexander Duyck wrote:
>> On Wed, Jun 8, 2016 at 12:46 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>> On 08.06.2016 17:38, Alexander Duyck wrote:
>>>> On Wed, Jun 8, 2016 at 7:48 AM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>>>> On 08.06.2016 14:51, Jiri Benc wrote:
>>>>>> On Mon, 6 Jun 2016 14:22:58 -0700, Jesse Gross wrote:
>>>>>>> On Sat, Jun 4, 2016 at 6:39 AM, Yi Yang <yi.y.yang at intel.com> wrote:
>>>>>>> [...]
>>>>>>>>  datapath/vport-netdev.c                           |   3 +-
>>>>>>>>  datapath/vport-vxlan.c                            |  17 ++-
>>>>>>>
>>>>>>> These changes aren't upstream yet. Please do that before backporting them here.
>>>>>>>
>>>>>>> However, the changes to vport-vxlan.c are modifying compatibility code
>>>>>>> that shouldn't be extended further. Instead, just use the existing
>>>>>>> VXLAN netlink interfaces that have already been created to enable
>>>>>>> these features.
>>>>>>>
>>>>>>> There is also a number of other patches to the OVS kernel module/VXLAN
>>>>>>> that have not been backported. Pravin started doing this work but it
>>>>>>> hasn't been applied yet. In general, I think it makes sense to
>>>>>>> backport patches in order so that the diffs of the patches match those
>>>>>>> of upstream.
>>>>>>>
>>>>>>> Finally, I have a question about receive side offloading with
>>>>>>> VXLAN-gpe. This is primarily an upstream issue but is present in the
>>>>>>> code being backported here as well. The VXLAN code sets up receive
>>>>>>> offloads for all ports regardless of whether they are classic VXLAN or
>>>>>>> L2/L3 GPE and expects NICs to parse the packets. I don't think this is
>>>>>>> safe because there are a number of NICs out there that predate the
>>>>>>> existence of GPE and therefore won't do this parsing correctly. I
>>>>>>> think that it is necessary to disable receive offloading for
>>>>>>> non-Ethernet VXLAN-GPE unless the offloading interface is extended.
>>>>>>
>>>>>> Coincidentally, I was talking about this with Hannes a few days ago.
>>>>>> I'm adding him to CC.
>>>>>>
>>>>>> I guess you're referring to ndo_add_vxlan_port, right? I agree that
>>>>>> this interface needs changes, especially considering that we know
>>>>>> whether the UDP port belongs to VXLAN or VXLAN-GPE. But from my
>>>>>> understanding of how drivers use this callback, the worst thing that
>>>>>> could happen is suboptimal generation of rx hashes and thus steering
>>>>>> the packets to a different receive queue than in the optimal case.
>>>>>> Surely something to fix but it seems it won't cause much functional
>>>>>> troubles with the current code?
>>>>>
>>>>> I am not sure if we must upgrade the interface. Can't drivers always
>>>>> configure vxlan-gpe and are always backwards compatible?
>>>>>
>>>>> Non vxlan-gpe capable hardware would have to abort checksum offloading
>>>>> as soon as they can't interpret the vxlan header anyway, so the packets
>>>>> end up on the slow path and nothing bad should happen.
>>>>>
>>>>> Possibly some hardware will verify inner checksums despite it could not
>>>>> understand the vxlan header completely. In this case we probably will
>>>>> drop the packet in the driver. Anyway, I would be in favor to simply
>>>>> present one knob, namely vxlan-offloading, to the user, instead a knob
>>>>> for each version of vxlan.
>>>>>
>>>>> Unfortunately I couldn't get a definitive answer from the specs
>>>>> regarding the checksuming details.
>>>>>
>>>>> Bye,
>>>>> Hannes
>>>>
>>>> This is starting to sound like the same conversation we had on netdev
>>>> when the ndo_add_geneve_port was added.  One easy fix for guaranteeing
>>>> that we can perform the checksum offload is to just enable the outer
>>>> UDP checksum.  Then we can still perform GRO and use the outer UDP
>>>> source port for generating a hash.  If possible we should make this
>>>> the default for all new UDP based tunnels going forward simply because
>>>> it allows for backwards-compatibility with existing offloads.
>>>
>>> Yes, but this time we are only in vxlan-versioning-world only. :)
>>
>> Right.  It leads me back into the original thought I had which was we
>> should be providing the UDP tunnel type via something like an
>> enumerated type so drivers could just opt in if they see a tunnel type
>> they recognize.  Having to add a function for each new tunnel type is
>> just silly.  Then we could have support for VXLAN be separate from
>> VXLAN-GPE without having to add a whole new set of functions.
>
> Partially agreed. I hope we can soon provide a patch that allows us to
> query the state of offloading of networking cards, a la ethtool. My idea
> was to make this depending on if the ndo_op for vxlan or geneve isn't NULL.
>
> Right now we still have the problem, that the check if the card actually
> supports udp offloading for vxlan or geneve is not depending on if the
> ndo op is available but can be checked in the specific handler. This
> should be factored out and I don't see a reason why we shouldn't convert
> to enum style.

Right so we could probably have a bitmap like netdev->features that we
can use to identify the tunnel types supported, we may even want to
split it up for Rx and Tx.  For example on the Intel NICs I can
offload everything for Tx, but I can only offload VXLAN and GENEVE on
the i40e while the rest of the tunnel types are currently unsupported.

>>> Hmm, it is a good question why we didn't do that already. Do you
>>> remember the reasons?
>>
>> Nope.  Although back when we last had this discussion LCO and GSO
>> partial didn't exist.  Now that they do we should probably just take
>> advantage of what we have since then we can get offloads automatically
>> for almost every new tunnel protocol assuming the NICs aren't relying
>> on parsing the frames in hardware/firmware for such an offload.
>
> I fear that most NICs are actually relying on parsing the frames in
> hardware/firmware. Isn't the only kind of modern networking card with
> full CHECKSUM_PARTIAL support the e1000(e) in my laptop? :)

Actually for most of the Intel NICs we can support CHECKSUM_PARTIAL
because the way the hardware works is that it just needs to know where
the inner TCP or UDP headers starts and if it is TCP or UDP and then
we can perform the checksum.  That was an update I pushed a few
revisions ago.  So in the case of those NICs they can offload pretty
much everything including FOU and GUE for Tx.  The only time they do
parsing is for the incoming Rx frames.

>> Also I know there are concerns about regressions on the older tunnel
>> protocols such as VXLAN and GENEVE since there is already support out
>> there in hardware and it could hurt performance to enable the outer
>> checksum if you are talking with an older kernel, but for newer
>> protocols such as VXLAN-GPE we can probably get away with just
>> changing the default early.  It is just a matter of finding a good
>> place to do that.  The problem there is the way way we specify
>> checksums for OVS right now is flag based if I recall and I haven't
>> found a good way to deal with it without possibly introducing
>> regressions on older kernels for the VXLAN and GENEVE tunnels.
>
> We might be able to feed infos from the offload back to the tunnel
> config, maybe. On Intel cards I don't see a reason why the outer
> checksum should hurt. Especially, if the outer checksum has been
> validated, we should be able to ignore the rest? If the opposite side is
> hurt by that, hmm, this could be handled by the OAM part of vxlan-gpe. ;)

Well right now since vxlan-gpe isn't really supported by any hardware
we could probably get away with enabling checksums so that we get at
least some offload support via validation of the outer checksum.  The
issue in terms of offloads has more to do with talking to older
kernels that don't support segmentation offloads for tunnel types that
included checksums in the outer header.

>>>> Really I wonder if we shouldn't just start pulling out all the support
>>>> for the ndo_add_vxlan/geneve_port code anyway, or at least start
>>>> forcing it to include more data such as an endpoint IP/IPv6 address
>>>
>>> As far as I know Intel cards don't support adding UDP tuples but only
>>> simple ports to the list of udp protocol offloads.
>>
>> Right.  That is one of the ones I am kind of grumbling about.  Really
>> it is a poor design prone to easily having false positives triggered.
>
> Totally agreed, but I fear we have to deal with the hardware at hand. :/

Agreed, but what are we really getting by adding the port value?  In
the case of the Intel hardware it isn't much as I can take advantage
of checksum offload if I just turn on the outer headers.  Then I don't
even need to enable the Rx offloads based on port number and I don't
pollute the VFs by offloading traffic that might be passing over them
that may or may not be offloaded.

>>> On the other side, I think the current way how we do it is quite
>>> okayish. At the same time where we establish the udp offload we
>>> unconditionally add the socket to the wildcard socket table (local_ip ==
>>> 0). At least no other socket can bind to such a socket (maybe
>>> IP_FREEBIND etc. are an exception here).
>>
>> Right.  It is the "ish" part that bugs me.  I don't like us doing
>> things like leaking stuff into namespaces between PF/VF boundaries.
>> With a little poking and prodding it becomes pretty easy for a VF to
>> figure out how the PF is configured in terms of what ports are being
>> offloaded.  Worse yet if hardware has exploits that can be targeted
>> based on offloading certain protocols.
>
> I fully agree, stateless offloads definitely win over all the approaches
> which most cards implement so far. I would be very happy if we can get
> away with those specific tunneling offload handlers.
>
>>>> and tunnel type.  There end up being a number of cases where the
>>>> offload could end up asserting itself where it isn't actually wanted
>>>> such as a mixed environment where the PF is setting up the offloads
>>>> via ndo_add_vxlan_port and then trying to apply that to VFs or other
>>>> PFs on the same device which aren't operating in the same network
>>>> namespace.  The worst case ends up being that you have UDP checksum
>>>> offloads stripped for flows that would have otherwise been reporting
>>>> valid checksum because the interfaces think they have tunnels with an
>>>> invalid inner checksum, or not L4 checksum at all.
>>>
>>> Okay, if the networking card mixes up offloads across vf interfaces we
>>> certainly can't trust them. Is this the case or did I misunderstood your
>>> example?
>>
>> I think you got the idea.  One of my gripes is if I use an i40e based
>> NIC for instance any offloads I configure on the PF affect the VFs.
>> That is fine when they are all in the same namespace.  If I move a VF
>> into a guest or a separate namespace though that shouldn't happen and
>> it does because many of these offloads are per device.  The same
>> problem occurs with fm10k, only it is worse in that case because the
>> port configuration is global to the switch if I recall correctly.
>
> $#!%$?! :)
>
> That means it also affects VMs by design? This is absolutely
> unacceptable. For e.g. fm10k, if I configure a vxlan offloaded udp port
> on a vf, it will propagate to the pf? This can't be used in production
> then, I fear.

I don't think the VF can request a VXLAN offload port, only the VFs
can.  If I recall correctly I think I see the same behavior for the
mlx4 and mlx5 drivers as well.  Basically enabling the offloads on the
PF end up enabling them for the VFs since they treat the VF's traffic
as though it is a part of the PF.

>>> Otherwise regarding the vxlan offloading config on the cards, I think we
>>> are fine right now. Even during namespace change of an device we
>>> correctly flush the ports from the firmware.
>>
>> Right.  If these were single interface cards I wouldn't really have an
>> issue, however the fact is most of them aren't.  They almost all
>> support multiple ports and SR-IOV and in many cases the firmware makes
>> these offload ports global to the device.  What we really should have
>> is something like a N-tuple type filter that uses a combination of
>> destination MAC address, IP, destination UDP port, and maybe even
>> physical port on the device that could be used to identify where we
>> want the offloads and where we don't.  Right now it seems like most of
>> the vendors are playing fast and loose with the offloads and applying
>> them to Tx and Rx which makes a real mess of things.  On top of that
>> many only support one port which makes things even worse because then
>> we have a mix of traffic where some is offloaded and some isn't.
>
> True. If I recall correctly, only i40e has 16 offloaded ports, most
> others just one. Could you imagine that we have networking cards that
> offloaded the vxlan-gpe standardized ports by default?
>
> Okay, with outer checksums we probably can circumvent most of the
> problems with those protocol specific offloads.
>
> The remaining problem regarding offloads would be, that we by default
> get into the situation that without the special offloading rule the
> vxlan stream will only be processed on one single core, as we tell
> network cards not to hash the udp ports into rxhash, which hurts a lot
> in case of vxlan, where we bias the flow identification on the source
> port without offloading available.

Most NICs offer the option of hashing on UDP ports.  In the case of
the Intel NICs I know you can turn on UDP port hashing by using
ethtool and setting UDP hasing to be enabled via "ethtool -N <iface>
udp4 sdfn".  You can do the same thing using "udp6" for IPv6 based
tunnels.  That is usually enough to cover all the bases and the fact
is not too many people are passing fragmented UDP traffic and as long
as that is the case enabling UDP hashing isn't too big of a deal.

- Alex