[ovs-dev] [PATCH] openvswitch: perform refragmentation for packets which pass through conntrack

Joe Stringer joe at cilium.io
Mon Mar 22 00:24:59 UTC 2021

Hey Aaron, long time no chat :)

On Fri, Mar 19, 2021 at 1:43 PM Aaron Conole <aconole at redhat.com> wrote:
> When a user instructs a flow pipeline to perform connection tracking,
> there is an implicit L3 operation that occurs - namely the IP fragments
> are reassembled and then processed as a single unit.  After this, new
> fragments are generated and then transmitted, with the hint that they
> should be fragmented along the max rx unit boundary.  In general, this
> behavior works well to forward packets along when the MTUs are congruent
> across the datapath.
> However, if using a protocol such as UDP on a network with mismatching
> MTUs, it is possible that the refragmentation will still produce an
> invalid fragment, and that fragmented packet will not be delivered.
> Such a case shouldn't happen because the user explicitly requested a
> layer 3+4 function (conntrack), and that function generates new fragments,
> so we should perform the needed actions in that case (namely, refragment
> IPv4 along a correct boundary, or send a packet too big in the IPv6 case).
> Additionally, introduce a test suite for openvswitch with a test case
> that ensures this MTU behavior, with the expectation that new tests are
> added when needed.
> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Signed-off-by: Aaron Conole <aconole at redhat.com>
> ---
> NOTE: checkpatch reports a whitespace error with the openvswitch.sh
>       script - this is due to using tab as the IFS value.  This part
>       of the script was copied from
>       tools/testing/selftests/net/pmtu.sh so I think should be
>       permissible.
>  net/openvswitch/actions.c                  |   2 +-
>  tools/testing/selftests/net/.gitignore     |   1 +
>  tools/testing/selftests/net/Makefile       |   1 +
>  tools/testing/selftests/net/openvswitch.sh | 394 +++++++++++++++++++++
>  4 files changed, 397 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/net/openvswitch.sh
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 92a0b67b2728..d858ea580e43 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -890,7 +890,7 @@ static void do_output(struct datapath *dp, struct sk_buff *skb, int out_port,
>                 if (likely(!mru ||
>                            (skb->len <= mru + vport->dev->hard_header_len))) {
>                         ovs_vport_send(vport, skb, ovs_key_mac_proto(key));
> -               } else if (mru <= vport->dev->mtu) {
> +               } else if (mru) {
>                         struct net *net = read_pnet(&dp->net);
>                         ovs_fragment(net, vport, skb, mru, key);

I thought about this for a while. For a bit of context, my
recollection is that in the initial design, there was an attempt to
minimize the set of assumptions around L3 behaviour and despite
performing this pseudo-L3 action of connection tracking, attempt a
"bump-in-the-wire" approach where OVS is serving as an L2 switch and
if you wanted L3 features, you need to build them on top or explicitly
define that you're looking for L3 semantics. In this case, you're
interpreting that the combination of the conntrack action + an output
action implies that L3 routing is being performed. Hence, OVS should
act like a router and either refragment or generate ICMP PTB in the
case where MTU differs. According to the flow table, the rest of the
routing functionality (MAC handling for instance) may or may not have
been performed at this point, but we basically leave that up to the
SDN controller to implement the right behaviour. In relation to this
particular check, the idea was to retain the original geometry of the
packet such that it's as though there were no functionality performed
in the middle at all. OVS happened to do connection tracking (which
implicitly involved queueing fragments), but if you treat it as an
opaque box, you have ports connected and OVS is simply performing
forwarding between the ports.

One of the related implications is the contrast between what happens
in this case if you have a conntrack action injected or not when
outputting to another port. If you didn't put a connection tracking
action into the flows when redirecting here, then there would be no
defragmentation or refragmentation. In that case, OVS is just
attempting to forward to another device and if the MTU check fails,
then bad luck, packets will be dropped. Now, with the interpretation
in this patch, it seems like we're trying to say that, well, actually,
if the controller injects a connection tracking action, then the
controller implicitly switches OVS into a sort of half-L3 mode for
this particular flow. This makes the behaviour a bit inconsistent.

Another thought that occurs here is that if you have three interfaces
attached to the switch, say one with MTU 1500 and two with MTU 1450,
and the OVS flows are configured to conntrack and clone the packets
from the higher-MTU interface to the lower-MTU interfaces. If you
receive larger IP fragments on the first interface and attempt to
forward on to the other interfaces, should all interfaces generate an
ICMPv6 PTB? That doesn't seem quite right, especially if one of those
ports is used for mirroring the traffic for operational reasons while
the other path is part of the actual routing path for the traffic.
You'd end up with duplicate PTB messages for the same outbound
request. If I read right, this would also not be able to be controlled
by the OVS controller because when we call into ip6_fragment() and hit
the MTU-handling path, it will automatically take over and generate
the ICMP response out the source interface, without any reference to
the OVS flow table. This seems like it's further breaking the model
where instead of OVS being a purely programmable L2-like flow
match+actions pipeline, now depending on the specific actions you
inject (in particular combinations), you get some bits of the L3
functionality. But for full L3 functionality, the controller still
needs to handle the rest through the correct set of actions in the

Looking at the tree, it seems like this problem can be solved in
userspace without further kernel changes by using
OVS_ACTION_ATTR_CHECK_PKT_LEN, see commit 4d5ec89fc8d1 ("net:
openvswitch: Add a new action check_pkt_len"). It even explicitly says
"The main use case for adding this action is to solve the packet drops
because of MTU mismatch in OVN virtual networking solution.". Have you
tried using this approach?

> diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
> index 61ae899cfc17..d4d7487833be 100644
> --- a/tools/testing/selftests/net/.gitignore
> +++ b/tools/testing/selftests/net/.gitignore
> @@ -30,3 +30,4 @@ hwtstamp_config
>  rxtimestamp
>  timestamping
>  txtimestamp
> +test_mismatched_mtu_with_conntrack
> diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
> index 25f198bec0b2..dc9b556f86fd 100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile

Neat to see some bootstrapping of in-tree OVS testing. I'd probably
put it in a separate commit but maybe that's just personal preference.

I didn't look *too* closely at the tests but just one nit below:

> +       # test a udp connection
> +       info "send udp data"
> +       ip netns exec server sh -c 'cat ${ovs_dir}/fifo | nc -l -vv -u 8888 >${ovs_dir}/fifo 2>${ovs_dir}/s1-nc.log & echo $! > ${ovs_dir}/server.pid'

There are multiple netcat implementations with different arguments
(BSD and nmap.org and maybe also Debian versions). Might be nice to
point out which netcat you're relying on here or try to detect & fail
out/skip on the wrong one. For reference, the equivalent OVS test code
detection is here:


More information about the dev mailing list