[ovs-dev] [PATCH] ovn: Fix the test failures in travis CI.

Numan Siddique nusiddiq at redhat.com
Fri Jul 12 17:45:40 UTC 2019


On Fri, Jul 12, 2019 at 10:29 PM Ben Pfaff <blp at ovn.org> wrote:

> On Fri, Jul 12, 2019 at 07:09:42PM +0300, Ilya Maximets wrote:
> > On 12.07.2019 19:06, Ilya Maximets wrote:
> > > On 11.07.2019 19:00, nusiddiq at redhat.com wrote:
> > >> From: Numan Siddique <nusiddiq at redhat.com>
> > >>
> > >> After the commit [1], below test cases are failing repeatedly in
> travis CI.
> > >>
> > >> 2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router
> gateway port FAILED (ovn.at:8597)
> > >> 2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router
> gateway port FAILED (ovn.at:8844)
> > >> 2667: ovn -- vlan traffic for external network with distributed
> router gateway port FAILED (ovn.at:9580)
> > >> 2691: ovn -- router - check packet length - icmp defrag FAILED (
> ovn.at:13624)
> > >>
> > >> With the commit [1], ovn-controller sends GARPs for the IPs of the
> distributed
> > >> router ports. The failing tests did not handle the situation if
> multiple GARPs
> > >> are sent. The failures are mostly timing related. This patch fixes
> these issues.
> > >>
> > >> [1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router
> port connected to bridged logical switch")
> > >>
> > >> Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router
> port connected to bridged logical switch")
> > >> CC: Ilya Maximets <i.maximets at samsung.com>
> > >> Signed-off-by: Numan Siddique <nusiddiq at redhat.com>
> > >> ---
> > >
> > > Hi.
> > > Thanks for working on this!
> > >
> > > I can confirm that this patch fixes frequent TravisCI failures.
> > > There are still some occasional failures of ovn tests, but it they was
> > > always there. (OVN tests has some timing issues).
> > >
> > > Tested-by: Ilya Maximets <i.maximets at samsung.com>
> > >
> > > However, I see that some failures was resolved by just removing the
> > > checks from tests. This somehow decreases the test coverage.
> > > So, It'll be good to have review from someone more familiar with
> > > these tests than me.
> > >
> > > Ben, what do you think about this patch?
> >
> > Oh. You just applied it. So, I assume, it's OK for you. =)
>
> If there's a better way to do it, I'm all for it.
>

There is an issue in ovn-northd after this commit [1]. This commit makes
use of the function
- sbrec_port_binding_update_nat_addresses_addvalue() to update the
Port_Binding.nat_addresses column.

Once this code is hit, ovn-northd wakes up from the poll_block()
continuously hogging the CPU.
I think we are seeing these CI test issues because of this.

>From the ovn-northd logs I can see the below transaction messages sent all
the time.

*********
..........
2019-07-12T17:26:13.837Z|74511|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18628
2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at
../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact",
params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13
172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
id=18629
2019-07-12T17:26:13.837Z|74515|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18629
2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at
../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact",
params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13
172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}],
id=18630
2019-07-12T17:26:13.837Z|74519|poll_loop|DBG|wakeup due to [POLLIN] on fd
11
(<->/home/nusiddiq/workspace_cpp/openvswitch/ovs/tutorial/sandbox/sb1.ovsdb)
at ../lib/stream-fd.c:157 (75% CPU usage)
2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18630
****************

We are seeing timing related frequent failures in the OpenStack CI tests
for networking-ovn.

Looks to me there is an issue with these variants (*update*_addvalue)  of
IDL functions.
I am submitting a patch in ovn-northd to not use this function as we need a
fix to unblock the Openstack CI gate.
But l think the actual issue seems to be in the IDL client code.

The issue can be reproduced with the below commands in the sandbox with ovn
enabled.

*************
ovs-appctl -t ovn-northd vlog/set dbg

ovn-nbctl lr-add lr0
ovn-nbctl ls-add public
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
ovn-nbctl lsp-add public public-lr0
ovn-nbctl lsp-set-type public-lr0 router
ovn-nbctl lsp-set-addresses public-lr0 router
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

ovn-nbctl lsp-add public ln-public
ovn-nbctl lsp-set-type ln-public localnet
ovn-nbctl lsp-set-addresses ln-public unknown
ovn-nbctl lsp-set-options ln-public network_name=public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

ovn-nbctl lr-add lr0
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
************.

Thanks
Numan





[1] -
https://github.com/openvswitch/ovs/commit/ed198fb3b92e2a0b1f594c22280803bfc2f66029#diff-2c35162acf6ad144624954fdc4c3d9f4


More information about the dev mailing list