[ovs-dev] WFP and tunneling and packet fragments

Samuel Ghinet sghinet at cloudbasesolutions.com
Tue Aug 5 16:49:40 UTC 2014


Eitan,

What I'm trying to say is:
1. This "assumption" cannot be applied on a production environment
2. You have the functionality to do TCP segmentation, but I don't think you have the functionality to do Ipv4 fragmentation. Nor to originate icmp4 / icmp6 errors.
3. I've just added a task. I believe we both agree that it is a task to do :)

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Tuesday, August 05, 2014 7:17 PM
To: Samuel Ghinet
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Sam,
The basic assumption is that the inner packet and the encapsulation header length is smaller than the host NIC MTU. We have the infrastructure in the driver to split the packet and we  can go ahead and implement your suggestion once we add more common cases (multiple buffers in NBL, macro flows etc..).
Thanks,
Eitan
-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Tuesday, August 05, 2014 7:52 AM
To: Eitan Eliahu
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Hi Eitan,

The approach is different in the meaning of "fragmenting" the packets so that after we encapsulate them they are of acceptable size (i.e. <= MTU of the external NIC), or convincing the VM to lower its MTU for destination (MTU Path finding) to our given value -- it does not change the MTU stored in the VM's Nic. So in my case it doesn't matter how big the MTU is set in the VM. Or I do not understand what you are saying?

MSS setting is an optimization.

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Tuesday, August 05, 2014 5:12 PM
To: Samuel Ghinet
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Hi Sam,
I'm wondering how this case is different from the case when  ipSec is being used.
I am sure at the least we can control the MTU of the host NIC and increase it to accommodate the encapsulation overhead (but, still nothing prevent the VM MTU getting increased even further) MSS setting, cannot be used for non TCP packets.
We are currently do take care of the case of TCP packets which are larger than the MSS size (OvsTcpSegmentNBL).
Thanks,
Eitan



-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Tuesday, August 05, 2014 6:23 AM
To: Eitan Eliahu
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Hello Eitan,

I personally do not find that a viable solution: I mean,I don't think we can requests the clients (i.e. those using the OS-s from within the VMs) to change their MTU to each of their OS?  Unless there is a method to automate this, from within the hypervisor, and not allow the user of the VM to mess things up from their OSs, I don't think it's a good solution at all.

The approach I had taken for our project was to do ipv4 fragmentation in code.
The issue is actually a bit more complex, and was dealt with in those found situations (except one issue I present at the bottom):
a) ipv4 packet too big if we add encap bytes, and the flag DF is not set for the ipv4 (payload) packet => fragment the buffer, then encapsulate each fragment
b) ipv4 packet too big if we add encap bytes, but the DF flag is set for the ipv4 (payload) packet: originate icmp4 error "packet too big" to VM, specifying an MTU value, considering the encap required bytes.
c) ipv6 packet too big, if we add encap bytes: originate icmp6 error packet to VM, specifying as MTU value, considering the encap required bytes.

Also, there is one more situation that I had handled: if the payload ip had DF set, and
VM1 on Hypervisor 1 MTU = 1500
VM2 Hypervisor 2 MTU = 1480
encap bytes = 60

The problem that had happened was that:
- Hypervisor1 used a packet 1440 bytes, and added 60 bytes => packet size = 1500
- an icmp4 error was coming back from VM2, from Hypervisor2, to the Hypervisor1 (max MTU = 1480)
- packet decapsulation: it says "packet too big, max mtu = 1480"
- send icmp4 error to VM1 (switch forwarding)
- VM1 retransmits, but uses the packet size = 1440 (1440 < 1480, so it think it's ok)
- The Packet comes again at Hypervisor1 to be encapsulated, and after encapsulation it has THE SAME SIZE as before, when it was a problem: 1500
- Basically the icmp4 error reporting solved thing, and the packet did not reach the destination at the end.

I had dealt with this by intercepting icmp4 errors coming from GRE (gre only, didn't get to do for vxlan), so that when it said "max MTU = 1480", I would change the packet and make VM1 receive "max MTU = 1420" (1480 - 60) This solved the problem.

Now, the problem that I did not handle, or better said, properly :-P The problem that I had done was that I did fragmentation before, and encapsulation after, when I should have done the other way around. And I had not handled received fragmented encapsulated packets - basically, for simple cases it had worked ok, because, the packets were being encapsulated after fragmentation, and packets did not normally reach the other side as fragments, in my test cases :). However, if we do encapsulation first, and fragmentation after, and deal with received fragmented encapsulated packets (e.g. via WFP) , we should be set.

Also, one more thing I had taken into account for my project was the Maximum Segment Size for TCP when used in tunneling: Basically, why do the fragmentation and all if we can tell the TCP of the other side what max tcp segment size to use?

Sam

________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Tuesday, August 05, 2014 3:25 PM
To: Samuel Ghinet
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Yes.
Eitan

-----Original Message-----
From: Samuel Ghinet [mailto:sghinet at cloudbasesolutions.com]
Sent: Tuesday, August 05, 2014 5:12 AM
To: Eitan Eliahu
Cc: dev at openvswitch.org
Subject: RE: WFP and tunneling and packet fragments

Thanks Eitan!

Regarding point 2: do you mean to set the MTU from within the VM?
As I remember, I had found no powershell cmdlet that changes the MTU of a VNic.

Sam
________________________________________
From: Eitan Eliahu [eliahue at vmware.com]
Sent: Monday, August 04, 2014 5:17 AM
To: Samuel Ghinet
Subject: RE: WFP and tunneling and packet fragments

Sam,
Here are some answers for your comments:
[1] WFP is used for Rx only and as you mentioned for fragmented packets only.
[2] Setting the VM MTU to accommodate the tunnel header is the correct configuration.
[3] We need to match the external packet in the flow table as other VXLAN packets could be received. (The external port is set to promiscuous mode by the VM switch). (There might be other reasons as well).
Thanks,
Eitan

-----Original Message-----
From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Samuel Ghinet
Sent: Sunday, August 03, 2014 11:53 AM
To: dev at openvswitch.org
Subject: [ovs-dev] WFP and tunneling and packet fragments

Hello guys,

I have studied a bit more the part of your code that deals with tunneling and WFP.

A summary of the flow, as I understand it:

ON RECEIVE (from external):
A. If there's a VXLAN encapsulated packet coming from outside, one that is NOT fragmented, the flow is like this:
1. Extract packet info (i.e. flow key)
2. find flow
3. if flow found => out to port X (or to multiple ports) 3.1. else => send to userspace: a flow will be created to handle the vxlan encapsulated packets that are NOT fragmented (but we'll later need to make a new flow for vxlan encapsulated packets that are fragmented)

For the case we have a flow, we output to a port X (which should be the manag os nic) After received by the manag os, the WFP will come in and call the registered callout / callback. This will decapsulate the vxlan packet, find a flow for it, and then execute the actions on the decapsulated packet (e.g. output to port Y).

The problem I find here is that the search for a flow is done twice.

B. If there's a VXLAN encapsulated packet coming from outside, one that IS fragmented, the flow is similar:
1. Extract packet info (i.e. flow key)
2. find flow
3. if flow found => out to port X (or to multiple ports) 3.1. else => send to userspace: a flow will be created to handle the vxlan encapsulated packets that are fragmented (but we'll later need to make a new flow for vxlan encapsulated packets that are not fragmented)

For the case we have a flow, we output to a port X (which should be the manag os nic) After received by the manag os, the WFP will come in, reassemble the fragmented vxlan packets, and then call the registered callout / callback. This will decapsulate the vxlan packet, find a flow for it, and then execute the actions on the decapsulated packet (e.g. output to port Y).

Again we have two searches for flow for the same packet.

ON SEND (to external / VXLAN):
There are three situations, as I see them:
1. the packet is small, and thus not an LSO either => encapsulate and output, all is perfect

2. the packet is LSO. The only case I found it in my tests (as I remember) was if the packet was coming from the manag os. If LSO is enabled in a VM, then, when reaching the switch, it is already fragmented and no longer has LSO (as NBL info).
Regarding LSO packets coming from manag os: As I remember, packets can be LSO here.
However, I believe there is no practical case in which we need to do "if in port = manag os => out to VXLAN".
I mean, tunneling is used for outputting packets from VMs only, as I understand.

3. The packet is not an LSO, but packet size + encap additional bytes > MTU (e.g. packet size = 1500).
Here we have two cases:
3.1. The packet is coming from manag os: In this case, if we do a netsh to lower the MTU below 1500 (i.e. taking in to account the max encap additional bytes), then, when a packet will need to be encapsulated, the MTU in the driver will be 1500, and the packet will be, say, 1420, instead of 1500. So it will work ok.
3.2. If the packet is coming from a VM: In this case, as I had tested, lowering the MTU in the manag os below 1500 did not solve the problem, as the packets coming from that VM were having size = 1500, so after being encapsulated they were too big.

I understand there is no WFP part for the sending of packets (to external) - and I actually believe there would be no place for WFP on send to external, since WFP callouts are called in a higher level on the driver stack than our driver.

So I've got several questions:
1. For receive (from external):
1.1. if we detect that the packet is an encapsulated packet (e.g. VXLAN) and also fragmented, should we not better match the flow disregarding the fragment type?
1.2. Could there be any method to avoid double flow lookup for the received encapsulated packets? A way to do this, I'm thinking, would be to defer the flow lookup when the packet is encapsulated, and simply output it to manag os port (that's where it must go anywhere), and the flow lookup only to be done in the WFP callback.
But I'm not sure... do only manag os packets reach the WFP callback, or packets from VMs as well?

2. For send (to external / VXLAN):
2.1. Do you deal with non-LSO 1500 bytes packets that arrive from VMs and must be sent to VXLAN?
2.2. I personally believe it is no practical scenario to send packets coming from the manag os to VXLAN port. If you believe otherwise, please let me know.

Thanks!
Sam
_______________________________________________
dev mailing list
dev at openvswitch.org
https://urldefense.proofpoint.com/v1/url?u=http://openvswitch.org/mailman/listinfo/dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=yTvML8OxA42Jb6ViHe7fUXbvPVOYDPVq87w43doxtlY%3D%0A&m=h2eGELlO1TY5x%2F6q%2BrWLhIWKQWsrjS101oerjTT7XdE%3D%0A&s=ea7fcc480ae0607ac5e4f20afbf516746584dba74e78ecce09871211164e8612



More information about the dev mailing list