[ovs-discuss] Possible new issue with flow modification

Tony van der Peet tony.vanderpeet at gmail.com
Thu Sep 17 04:26:14 UTC 2015


Jarno

Yes, I can verify that your patch fixes my specially constructed test
case. Thanks
very much for this, I look forward to my next update from upstream
when I can throw
away my temporary fix.

Tony

On Wed, Sep 16, 2015 at 12:00 PM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
> Tony,
>
> Sorry that it took some time to get to this, and thanks for the detailed
> report, with it it was straightforward to add a test case verifying what is
> happening here.
>
> As confusing as it may be, the datapath flow dumps that show no vlan matches
> actually exact match the non-presence of vlans. This is due to how the
> netlink flow key interface has been designed, and relates to the fact that
> vlans are expressed with a specific ethertype, the vlan field, and the rest
> of the match headers in encapsulated netlink attributes. To enforce this,
> the Linux kernel module sets the vlan mask to all ones by default, which
> will then be overwritten only if there is an actual match on vlans. OVS
> userspace compensates for this when translating a netlink flow mask to OVS
> internal struct flow representation by also defaulting the vlan_tci field to
> all ones (0xffff). Hence, the existing code in revalidate_ukey() function is
> correct and does not need a fix.
>
> However, the userspace datapath, that is used for testing (I believe also
> with oftest), did not do anything special with the vlan_tci field, and was
> happy to wildcard it. Since the netlink format is still used when
> revalidating and dumping the userspace datapath flows, the same compensation
> described above was done also with userspace datapath flows, which indeed
> caused the existing megaflow to be modified and then matched regardless of
> the presence of vlan tags in incoming packets. I have sent out a patch that
> changes the userspace datapath to also force exact matching the vlan_tci
> field and adds a test case verifying that this works as intended.
>
> You can find the patch here:
> http://openvswitch.org/pipermail/dev/2015-September/060083.html
>
> It would be superb it you could verify that this fixes your test case
> problems.
>
> Regards,
>
>   Jarno
>
>
> On Aug 18, 2015, at 6:49 PM, Tony van der Peet <tony.vanderpeet at gmail.com>
> wrote:
>
> Ethan
>
> Thanks for the response. I have made some further progress, and in
> fact have discovered a slightly crude fix that allows my tests to
> pass. In ofproto-dpif-upcall.c:revalidate_ukey I have done this:
>
>    /* Since the kernel is free to ignore wildcarded bits in the mask, we
> can't
>     * directly check that the masks are the same.  Instead we check that the
>     * mask in the kernel is more specific i.e. less wildcarded, than what
>     * we've calculated here.  This guarantees we don't catch any packets we
>     * shouldn't with the megaflow. */
> // start of "fix"
>    /* To avoid all sorts of bother about special meanings of VLAN tags (VLAN
>     * present, VLAN absent, etc) let's just require the VLAN masks to be
> equal.
>     */
>    if (dp_mask.vlan_tci != wc.masks.vlan_tci) {
>        goto exit;
>    }
> // end of "fix"
>    dp64 = (uint64_t *) &dp_mask;
>    xout64 = (uint64_t *) &wc.masks;
>    for (i = 0; i < FLOW_U64S; i++) {
>        if ((dp64[i] | xout64[i]) != dp64[i]) {
>            goto exit;
>        }
>    }
>
> Obviously, "goto exit" cannot be wrong, it's just affecting
> performance. Whether it's the best possible fix is another question!
>
> This is the test I wrote using the oftest framework:
>
> class FlowInternalModify(MatchTest):
>    """
>    Match on ipv4 protocol field (UDP), then change to match on VLAN absence.
>    """
>    def runTest(self):
>        # Start out with no flows
>        delete_all_flows(self.controller)
>        do_barrier(self.controller)
>        time.sleep(2)
>
>        match = ofp.match([
>            ofp.oxm.eth_type(0x0800),
>            ofp.oxm.ip_proto(17),
>        ])
>
>        matching = {
>            "udp": simple_udp_packet(),
>        }
>
>        nonmatching = {
>            #"tcp": simple_tcp_packet(),
>            #"icmp": simple_icmp_packet(),
>        }
>
>        self.verify_match(match, matching, nonmatching)
>        time.sleep(3)
>
>        # Interlude, delete all flows
>        delete_all_flows(self.controller)
>        do_barrier(self.controller)
>        time.sleep(2)
>
>        match = ofp.match([
>            ofp.oxm.vlan_vid(ofp.OFPVID_NONE),
>        ])
>
>        matching = {
>            "no vlan tag": simple_udp_packet()
>        }
>
>        nonmatching = {
>            "vid=2 pcp=3": simple_udp_packet(dl_vlan_enable=True,
> vlan_vid=2, vlan_pcp=3),
>            "vid=0 pcp=7": simple_udp_packet(dl_vlan_enable=True,
> vlan_vid=0, vlan_pcp=7),
>            "vid=2 pcp=0": simple_udp_packet(dl_vlan_enable=True,
> vlan_vid=2, vlan_pcp=0),
>        }
>
>        self.verify_match(match, matching, nonmatching)
>
> Here's a run of this test without the "fix", with rules and flows
> being shown alternately while the test runs (some editing has been
> carried out).
>
> awplus#sh open flow
>
>
> awplus#sh open rule
> table_id=254, duration=57s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=57s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=57s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=57s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
>
> Test just started, deleted all rules and flows are aged out
>
>
> awplus#sh open rule
> duration=0s, n_packets=0, n_bytes=0, priority=1000,udp,actions=output:2
> duration=0s, n_packets=0, n_bytes=0, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=61s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=61s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=61s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=61s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
> recirc_id(0),in_port(1),eth_type(0x0800),ipv4(proto=17,frag=no),
> packets:0, bytes:0, used:never, actions:2
>
> First part of test, rule and flow are OK and in sync
>
>
> awplus#sh open rule
> table_id=254, duration=65s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=65s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=65s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=65s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
> recirc_id(0),in_port(1),eth_type(0x0800),ipv4(proto=17,frag=no),
> packets:1, bytes:104, used:2.768s, actions:drop
>
> Rule has been deleted but flow hangs around, but with its action changed to
> drop, as expected.
>
>
> awplus#sh open rule
> duration=1s, n_packets=0, n_bytes=0,
> priority=1000,vlan_tci=0x0000/0x1fff,actions=output:2
> duration=1s, n_packets=0, n_bytes=0, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=70s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=70s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=70s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=70s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
> recirc_id(0),in_port(1),eth_type(0x0800),ipv4(proto=17,frag=no),
> packets:4, bytes:416, used:0.219s, actions:2
>
> New rule has been put in place, but whoops! old flow is still there and has
> inherited the new rule's
> action. This is not so good because the flow will take UDP packets
> regardless of the presence of
> a VLAN tag.
>
>
> awplus#sh open rule
> duration=11s, n_packets=3, n_bytes=312,
> priority=1000,vlan_tci=0x0000/0x1fff,actions=output:2
> duration=11s, n_packets=0, n_bytes=0, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=80s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=80s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=80s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=80s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
>
> Test over and flow has aged out. Test does not get rid of rule.
>
>
>
> WIth the above mentioned fix in place, here's what happens:
>
> awplus#sh open flow
>
>
> awplus#sh open rule
> table_id=254, duration=21s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=21s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=21s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=21s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
>
> Before test runs, switch has just rebooted, so nothing there.
>
>
> awplus#sh open rule
> duration=0s, n_packets=0, n_bytes=0, priority=1000,udp,actions=output:2
> duration=0s, n_packets=0, n_bytes=0, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=31s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=31s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=31s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=31s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
> recirc_id(0),in_port(1),eth_type(0x0800),ipv4(proto=17,frag=no),
> packets:0, bytes:0, used:never, actions:2
>
> First part of test, same as before, flow and rule are in sync
>
>
> awplus#sh open rule
> table_id=254, duration=35s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=35s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=35s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=35s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
>
>
> Rule deleted in the interlude, and my new "fix" causes the flow to be
> deleted.
>
>
> awplus#sh open rule
> duration=0s, n_packets=0, n_bytes=0,
> priority=1000,vlan_tci=0x0000/0x1fff,actions=output:2
> duration=0s, n_packets=0, n_bytes=0, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=38s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=38s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=38s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=38s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
> recirc_id(0),in_port(1),eth_type(0x0800),ipv4(frag=no), packets:0,
> bytes:0, used:never, actions:2
> recirc_id(0),in_port(1),eth_type(0x8100),vlan(vid=0,pcp=7/0x0),encap(eth_type(0x0800),ipv4(frag=no)),
> packets:1, bytes:100, used:
> 0.368s, actions:userspace(pid=0,slow_path(controller))
> recirc_id(0),in_port(1),eth_type(0x8100),vlan(vid=2),encap(eth_type(0x0800),ipv4(frag=no)),
> packets:3, bytes:300, used:0.344s, ac
> tions:userspace(pid=0,slow_path(controller))
>
> The second part of the test is running and things are looking more like they
> do
> when the second part is run as a test on its own.
>
>
> awplus#sh open rule
> duration=11s, n_packets=2, n_bytes=204,
> priority=1000,vlan_tci=0x0000/0x1fff,actions=output:2
> duration=11s, n_packets=6, n_bytes=600, priority=1,actions=CONTROLLER:65535
> table_id=254, duration=50s, n_packets=0, n_bytes=0,
> priority=2,recirc_id=0,actions=drop
> table_id=254, duration=50s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x1,actions=controller(reason=no_match)
> table_id=254, duration=50s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x2,actions=drop
> table_id=254, duration=50s, n_packets=0, n_bytes=0,
> priority=0,reg0=0x3,actions=drop
> awplus#sh open flow
>
> And eventually the flow times out.
>
>
> So that's it!
>
> To answer your question, you can see that the flow hangs around, with
> its action changed, but it is no longer really representative of the
> new rule since it covers all UDP packets regardless of whether there
> is a VLAN tag in the packet or not. The non-matching cases are meant
> to be sent to the controller, but the flow sends them to port 2
> instead.
>
> By the way, your apology was totally unnecessary! You and the team do
> such a great job on this code and quality is so high that I moved to
> tip of master a long time ago and have never regretted it. The number
> of issues I find in the code is tiny. Thanks for all your efforts.
>
> Tony
>
>
> On Wed, Aug 19, 2015 at 12:07 PM, Ethan Jackson <ethan at nicira.com> wrote:
>
> Hi Tony,
>
> Sorry you're running into this, it does sound like a problem this
> patch could cause.  I suppose what I'm having trouble working out here
> is whether this is an actual bug in the patch, or if you're test
> framework is making assumptions about how the code works which are no
> longer valid.  Any additional information you have on your setup would
> be especially helpful, especially if you can whittle it down into a
> somewhat minimal series of steps which reproduces the problem.
>
> Some comments inline:
>
> In a nutshell, flows created for earlier tests are being kept alive
> and having their actions modified, but then the modification as a
> result of a later test causes that later test to fail.
>
>
> Correct, as a result of this patch, we try to modify flows in the
> datapath in place, rather than deleting them when their underlying
> openflow rules changes.
>
>
> when the rule becomes:
>
> duration=2s, n_packets=0, n_bytes=0,
> priority=1000,vlan_tci=0x0000/0x1fff,actions=output:2
>
>
> So what you're saying is, you add a rule that matches on VLAN tag, but
> the datapath rule corresponding to it never updates?  So all your
> traffic forwards based on the old rule not the new one?
>
> Ethan
>
>
> Disregarding the flow's age, any UDP packet sent, VLAN or not, will
> result in output:2, but this is not the correct action for all packets
> (those with a VLAN tag).
>
> My guess is that this is fairly deep and beyond my limited experience
> with OpenVSwitch. However, I will continue to work this issue, but
> thought you might like to know. And any advice much appreciated.
>
> Tony
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss
>
>
>



More information about the discuss mailing list