[ovs-discuss] [openvswitch 2.9.2] testsuite: 8 978 failed on Fedora Rawhide after fa9a62453ea4

Tue Jul 3 18:32:51 UTC 2018

On Tue, Jul 03, 2018 at 12:32:26PM +0200, Timothy Redaelli wrote:
> Hi,
> I'm debugging a failure in "8: vsctl-bashcomp - argument completion"
> and "978: ofproto - ofport_request" on Fedora Rawhide that prevent me
> to release OVS 2.9.2. The 2 tests fails for the same root cause and
> they are present on current ovs master too.
> 
> After a bisect I found that the problematic commit is fa9a62453ea4
> ("ovsdb: Introduce experimental support for clustered databases.").
> 
> I can only see the problem on Fedora Rawhide since it has some
> debugging kernel config options enables that emphasize the problem.
> If I re-build the same kernel without debugging kernel option the
> problem is not easily reproducible.
> 
> After other analysis I found that the problem is that some ovs-vsctl
> commands (for example "ovs-vsctl get-manager" that is used by
> "ovs-vsctl-bashcomp.bash") sometimes generates a "Connection reset by
> peer" in ovsdb-server.log and, with the kernel-debug, it became a:
> "2018-07-03T09:53:36.401Z|00038|jsonrpc|WARN|Dropped 23 log messages in
> last 11 seconds (most recently, 1 seconds ago) due to excessive rate"
> error that makes the test fail since `check_logs` (ofproto-macros.at)
> doesn't ignore the "Dropped X log messages" log message.
> 
> In check_logs I can read the following comments:
> # We most notably ignore 'Broken pipe' warnings.  These often and
> # intermittently appear in ovsdb-server.log, because *ctl commands
> # (e.g. ovs-vsctl, ovn-nbctl) exit right after committing a change to
> the # database.  However, in reaction, some daemon may immediately
> update the # database, and this later update may cause database sending
> update back to # *ctl command if *ctl has not exited yet.  If *ctl
> command exits before # the database calls send, the send fails with
> 'Broken pipe'.  Also removes # all "connection reset" warning logs for
> similar reasons (either EPIPE or # ECONNRESET can be returned on a send
> depending on whether the peer had # unconsumed data when it closed the
> socket).
> 
> so I don't know which could be a good approach since after the
> fa9a62453ea4 commit is not "often", but it's almost "always" (on
> kernel-debug).
> 
> Do you have any ideas?

Thanks for the report.

Does this fix the problem?
https://patchwork.ozlabs.org/patch/938851/