[ovs-dev] [PATCH] VxLAN-gpe implementation

Thu Jun 9 21:06:46 UTC 2016

On 09.06.2016 22:35, Alexander Duyck wrote:
> On Thu, Jun 9, 2016 at 12:23 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>> On 09.06.2016 18:14, Alexander Duyck wrote:
>>> On Thu, Jun 9, 2016 at 3:57 AM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>>> On 09.06.2016 04:33, Alexander Duyck wrote:
>>>>> On Wed, Jun 8, 2016 at 3:20 PM, Hannes Frederic Sowa <hannes at redhat.com> wrote:
>>>>>> The remaining problem regarding offloads would be, that we by default
>>>>>> get into the situation that without the special offloading rule the
>>>>>> vxlan stream will only be processed on one single core, as we tell
>>>>>> network cards not to hash the udp ports into rxhash, which hurts a lot
>>>>>> in case of vxlan, where we bias the flow identification on the source
>>>>>> port without offloading available.
>>>>>
>>>>> Most NICs offer the option of hashing on UDP ports.  In the case of
>>>>> the Intel NICs I know you can turn on UDP port hashing by using
>>>>> ethtool and setting UDP hasing to be enabled via "ethtool -N <iface>
>>>>> udp4 sdfn".  You can do the same thing using "udp6" for IPv6 based
>>>>> tunnels.  That is usually enough to cover all the bases and the fact
>>>>> is not too many people are passing fragmented UDP traffic and as long
>>>>> as that is the case enabling UDP hashing isn't too big of a deal.
>>>>
>>>> True, I am wondering how safe it is given the reordering effects it has
>>>> on UDP and thus other non vxlan management protocols on the hypervisors.
>>>>
>>>> At that time, when UDP port hashing was disabled, the message was pretty
>>>> clear by upstream and I don't think for the default case anything should
>>>> change here.
>>>>
>>>> Are the port hashing features also global or tweakable per VF?
>>>
>>> That one depends on the device.  I think in the case of some of the
>>> newer NICs the VFs support separate RSS tables.  The ones that have
>>> shared RSS tables typically share how they compute the hashes.  So for
>>> example with igb and ixgbe you get a shared has computation where the
>>> PF will impact the VFs.  One easy fix for the reordering though is to
>>> simply disable RSS on the VFs which in many cases will likely occur
>>> anyway unless the guest has multiple VCPUs.
>>
>> Sounds like a bad limitation. I assume multiple VCPUs are used in VMs (I
>> even do that).
> 
> Right.  So do I.  However many VFs are still greatly limited in the
> number of queues they can support.

Ok, interesting, thanks!

> Also in terms of impact on the VFs having the UDP hashing enabled for
> RSS is only really an issue if you have a mix of fragmented and
> non-fragmented traffic for the same flow.
> 
>> Hypothetically for IPv4 vxlan in a datacenter, can't we randomize the
>> IPv4 address bits and isolate it properly just as the transport protocol
>> for vxlan (e.g. the lower 2 bytes)? But that is becoming ugly...
> 
> Mangling the address would probably be even worse.

If the data plane only uses 10.0.0.0/16 and this subnet gets strictly
blocked to leave the lower ethernet and is only used to transport vxlan
communication, I don't see reason why we couldn't use the 2 bytes as
flowlabel. ;)

Not that I like this approach at all... ;)

>> We break ICMP already with UDP source port randomization.
> 
> I hadn't thought about that before.  Is that also the reason why we
> don't have any PMTU discovery for UDP tunnels?

Yes, the original packet causing the packet-too-big-error is reflected
to the ip-source with the icmp header and as payload the headers of the
original packet. UDP code in Linux checks the payload of the ICMP packet
to find the ports and looks up the socket which send out the original
packet. If no socket is found, the icmp error p-t-b is dropped.

This prevents e.g. path mtu spoofing attacks, one example:

One attack forges path mtu errors towards DNS servers, which could start
to fragment along the way. The DNS header contains the nonce to validate
if the packet is the correct answer to the question. The rest of the
entropy stems from the random source port the DNS resolver should have
used. If you can force fragmentation along the way, you only need to
guess the 16 bit identification id instead of the 32 bit nonce + source
port to spoof a packet and get it appended to a DNS response, thus
enabling something like the Kaminsky attack.

Also note, that other operating systems, like Solaris, FreeBSD and I
think also Windows (at least some versions ago, I think I tested vista),
didn't do path mtu discovery for UDP at all!

>>> In the case of ixgbe it just occurred to me that there is also an
>>> option of applying flow director rules and it would be possible to
>>> just add a rule for each CPU so that you split the UDP source port
>>> space up based on something like the lower 4 bits assuming 16 queues
>>> for instance.
>>
>> The deployment of that based on the used hardware will also be terrible.
> 
> Right.  I never said it was going to be completely pretty.  Still it
> is no worse then the kind of stuff we already have going on since we
> are applying many of these offloads per device and not isolating the
> VFs from the PF.
> 
> I find all of this to be much more palatable then stuff like remote
> checksum offload and the like in order to try and make this work.
> With the current igb, ixgbe, or i40e driver and an outer checksum
> being present I can offload just about any of the UDP tunnel types
> supported by the kernel including FOU, GUE, and VXLAN-GPE and get full
> hardware offloads for segmentation and Rx checksum.

Can the hypervisor totally control all VFs, also if they are already
attached to VMs? So the drivers could basically always do the right
thing. It knows if some resource is shared or not.

I agree, I would like to have local code handle that automatically
before we introduce more bytes in protocol headers. Bloating up network
headers is not the right way.

Bye,
Hannes