[ovs-dev] ovn ping from VM to external gateway IP failed.

Numan Siddique nusiddiq at redhat.com
Tue Jan 3 04:59:59 UTC 2017


On Tue, Jan 3, 2017 at 2:06 AM, Mickey Spiegel <mickeys.dev at gmail.com>
wrote:

>
> On Mon, Jan 2, 2017 at 3:46 AM, Numan Siddique <nusiddiq at redhat.com>
> wrote:
>
>>
>>
>> On Mon, Jan 2, 2017 at 2:07 AM, Mickey Spiegel <mickeys.dev at gmail.com>
>> wrote:
>>
>>>
>>> On Sun, Jan 1, 2017 at 10:31 AM, Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sun, Jan 1, 2017 at 6:39 AM, Mickey Spiegel <mickeys.dev at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Sat, Dec 31, 2016 at 1:19 AM, Mickey Spiegel <mickeys.dev at gmail.com
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>> On Fri, Dec 30, 2016 at 11:37 AM, Mickey Spiegel <
>>>>>> mickeys.dev at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 30, 2016 at 7:46 AM, Numan Siddique <nusiddiq at redhat.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> On Fri, Dec 30, 2016 at 5:36 PM, Dong Jun <dongj at dtdream.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>
>>>>>>>>>>>>>>>> Hi Dong Jun, I am also facing the same issue on my setup.
>>>>>>>>>>>>>>>> These are the findings of my investigation so far
>>>>>>>>
>>>>>>>> Looks like this issue is seen after the commit
>>>>>>>> https://github.com/openvswitch/ovs/commit/f1a8bd06d58f2c5312
>>>>>>>> 622fbaeacbc6ce7576e347
>>>>>>>>>>>>>>>> which removes the usage of patch ports and uses the clone action
>>>>>>>> instead.
>>>>>>>>>>>>>>>>
>>>>>>>> I reverted to the commit just before it and SNAT/DNAT is working as
>>>>>>>> expected.
>>>>>>>>
>>>>>>>> In my case, the gateway router is hosted on node 1 and the I am
>>>>>>>> trying to
>>>>>>>> reach a VM (192.168.0.5) hosted on node 2 using the external ip
>>>>>>>> (10.2.7.105) associated ​with it. I could see that the node 1 is
>>>>>>>> sending
>>>>>>>> the packet to node 2 through the geneve tunnel, but it is dropped
>>>>>>>> by node 2
>>>>>>>> flows.
>>>>>>>>
>>>>>>>> Below is the tcpdump of the packet
>>>>>>>>
>>>>>>>> **************************
>>>>>>>> 19:39:44.709907 IP 182.16.0.16.60069 > 182.16.0.15.geneve: Geneve,
>>>>>>>> Flags
>>>>>>>> [none], vni 0x1: IP nusiddiq.blr.redhat.com > 192.168.0.5: ICMP
>>>>>>>> echo
>>>>>>>> request, id 13240, seq 1, length 64
>>>>>>>> ***************************
>>>>>>>>
>>>>>>>> Below is the tcpdump of the packet with the ovn-controller (without
>>>>>>>> the
>>>>>>>> above commit) in the working case
>>>>>>>>
>>>>>>>> **************************
>>>>>>>> 19:41:56.783570 IP 182.16.0.12.29778 > 182.16.0.15.geneve: Geneve,
>>>>>>>> Flags
>>>>>>>> [C], vni 0x1, options [8 bytes]: IP nusiddiq.blr.redhat.com >
>>>>>>>> 192.168.0.5:
>>>>>>>> ICMP echo request, id 13308, seq 1, length 64
>>>>>>>> 19:41:56.784270 IP 182.16.0.15.14539 > 182.16.0.12.geneve: Geneve,
>>>>>>>> Flags
>>>>>>>> [C], vni 0xf, options [8 bytes]: IP 192.168.0.5 >
>>>>>>>> nusiddiq.blr.redhat.com:
>>>>>>>> ICMP echo reply, id 13308, seq 1, length 64
>>>>>>>> **************************
>>>>>>>>
>>>>>>>> The options data has - 00030005
>>>>>>>>
>>>>>>>> From the packet, I could see that the packet from node 1 is missing
>>>>>>>> the
>>>>>>>> geneve option fields which has inport and outport keys.
>>>>>>>>
>>>>>>>
>>>>>>> I am facing the same issue running my distributed NAT patch set.
>>>>>>> Between UNSNAT recirc and output to tunnel, a megaflow is installed
>>>>>>> that
>>>>>>> is missing the geneve option fields.
>>>>>>>
>>>>>>> I verified that the table=32 openflow rule has the geneve option
>>>>>>> fields.
>>>>>>> ofproto/trace shows geneve in the "Datapath actions" at the end, so
>>>>>>> no
>>>>>>> problem with whatever ofproto/trace is using.
>>>>>>>
>>>>>>
>>>>>> Throwing some logs in, I see that flow->metadata.present.map is 0
>>>>>> rather
>>>>>> than 1 coming into tun_metadata_to_geneve_nlattr() in
>>>>>> lib/tun-metadata.c,
>>>>>> when the problem occurs. That is why the geneve option fields are
>>>>>> missing.
>>>>>>
>>>>>> I have not yet figured out why flow->metadata.present.map is 0. It
>>>>>> should
>>>>>> be modified when tun_metadata_write() is called due to actions setting
>>>>>> tunnel metadata values. I have not checked that yet.
>>>>>>
>>>>>
>>>>> I just posted a fix. I did not try it with the gateway router or with
>>>>> OpenStack,
>>>>> but with this bug fix all distributed NAT manual test cases are now
>>>>> passing.
>>>>>
>>>>>
>>>> ​Thanks for the fix. I just tested it. Its working when I am trying to
>>>> reach the ​VM using its floating ip. But not when trying to ping
>>>> www.google.com from the VM (SNAT use case)
>>>>
>>>
>>> With distributed NAT, most of my debugging and tests were using SNAT.
>>> The bug fix that I posted fixed the problem that was causing ICMP echo
>>> replies to be dropped. The openflow path for distributed SNAT is similar to
>>> that for SNAT on gateway routers, but there are still some differences,
>>> notably one router instead of two routers and no "join" switch. Also I did
>>> not try it with DNS.
>>>
>>> Are you able to debug further, to see whether a missing geneve options
>>> field is still the culprit?
>>> It is possible that removal of patch ports within br-int uncovered other
>>> issues.
>>>
>>
>>
>> ​With some testing I could see that in the node where the gateway is
>> hosted
>>  - The ​reply packet reaches the gateway router pipeline -> to the otls
>> switch pipeline (via clone) -> to the router pipeline -> to the peer port
>> of the switch.
>> ​The packet gets dropped at table 22
>>
>>  table=22, n_packets=275, n_bytes=26686, priority=65535,ct_state=+inv+trk,metadata=0x1
>> actions=drop
>>
>> Not sure why it is happening. I will try to debug further.
>>
>
> I added stateful ACLs, but I am unable to reproduce this. Nothing hits
> the invalid ct_state flow, trying switch -> router -> switch, across
> localnet at the end, and with various distributed NAT flavors including
> DNAT and SNAT. The pings always succeed.
>
> As I suggested on IRC, I think that conntrack state should be cleared
> when crossing an OVN patch port. Specifically, in
> ovn/controller/physical.c,
> inside the clone, it should clear ct_state (MFF_CT_STATE, be16),
> ct_mark (MFF_CT_MARK, be32), and ct_label (MFF_CT_LABEL, be128).
>

​Thanks for the suggestion. I couldn't clear the ct fields in clone action
as these are not writable fields.
Instead i tried with the below patch and it worked.

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 44fe3d1..7a4b782 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4332,7 +4332,11 @@ static void
 compose_clone_action(struct xlate_ctx *ctx, const struct ofpact_nest *oc)
 {
     struct flow old_flow = ctx->xin->flow;
+    bool old_conntrack = ctx->conntracked;
+    ctx->conntracked = false;
+    clear_conntrack(&ctx->xin->flow);
     do_xlate_actions(oc->actions, ofpact_nest_get_action_len(oc), ctx);
+    ctx->conntracked = old_conntrack;
     ctx->xin->flow = old_flow;
 }

​Thanks
Numan​

​


>
> Mickey
>
>
>>
>> Numan
>>
>>
>>
>>> I primarily used ovs-dpctl dump-flows to see installed megaflows, ovs-appctl
>>> ofproto/trace (with recirc_id), and ovs-ofctl dump-flows for initial
>>> debugging. In particular I could see that the installed megaflows were
>>> lacking the geneve options field in the actions.
>>>
>>> Mickey
>>>
>>>
>>>> Numan
>>>>
>>>>
>>>>> Mickey
>>>>>
>>>>>
>>>>>> Mickey
>>>>>>
>>>>>>
>>>>>>> Mickey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Numan
>>>>>>>>
>>>>>>>>
>>>>>>>> > _______________________________________________
>>>>>>>> > dev mailing list
>>>>>>>> > dev at openvswitch.org
>>>>>>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>> >
>>>>>>>> _______________________________________________
>>>>>>>> dev mailing list
>>>>>>>> dev at openvswitch.org
>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


More information about the dev mailing list