[ovs-discuss] OpenFlow port number leak causing OVN GW data-plane down

Han Zhou zhouhan at gmail.com
Thu Nov 8 07:01:20 UTC 2018

Hello folks,

I am writing to share a problem and fix, and also ask a question.
I found a problem this week which caused OVN GW data-plane down. After
onboarding a hypervisor to existing OVN deployment where a lot a
hypervisors and VMs have been running well, suddenly all GW nodes' all BFD
tunnel status were down, shown by "ovs-vsctl show" (and of course, all VMs
lost connection).

Checking the logs of ovs-vswitchd, there were same logs shown on all GW
nodes that ofp port number 65535 is used for creating the new tunnel
interface to the new hypervisor, e.g.:

2018-11-06T01:29:10.042Z|142103|dpif(ovs-vswitchd)|WARN|system at ovs-system:
failed to add ovn-aded97-0 as port: Device or resource busy
2018-11-06T01:29:10.045Z|142104|bridge(ovs-vswitchd)|INFO|bridge br-int:
added interface ovn-aded97-0 on port 65535
2018-11-06T01:29:11.479Z|142108|ofproto(ovs-vswitchd)|WARN|br-int: cannot
configure bfd on nonexistent port 65535
2018-11-06T01:29:11.479Z|142109|ofproto(ovs-vswitchd)|WARN|br-int: cannot
configure LLDP on nonexistent port 65535
2018-11-06T01:29:11.479Z|142110|ofproto(ovs-vswitchd)|WARN|br-int: cannot
configure datapath on nonexistent port 65535
2018-11-06T01:29:18.783Z|142117|bfd(ovs-vswitchd)|INFO|ovn-aded97-0: BFD
state change: admin_down->down "No Diagnostic"->"No Diagnostic".
2018-11-06T01:29:18.785Z|00061|bfd(monitor82)|INFO|Interface ovn-aded97-0
remote mult value 0 changed to 3
2018-11-06T01:29:18.785Z|00062|bfd(monitor82)|INFO|ovn-aded97-0: New remote
2018-11-06T01:29:18.773Z|142111|bridge(ovs-vswitchd)|INFO|bridge br-int:
deleted interface ovn-aded97-0 on port 65535
2018-11-06T01:29:18.779Z|142115|dpif(ovs-vswitchd)|WARN|system at ovs-system:
failed to add ovn-aded97-0 as port: Device or resource busy
2018-11-06T01:29:18.782Z|142116|bridge(ovs-vswitchd)|INFO|bridge br-int:
added interface ovn-aded97-0 on port 65535
2018-11-06T01:29:18.785Z|00064|bfd(monitor82)|WARN|ovn-aded97-0: Incorrect

After debugging with the OVS code, here is reason why 65535 is used as port
Auto-generated port number range is between 1 - 32768. If all the numbers
are used, the functionalloc_ofp_port() will return this OFPP_NONE which is
65535. But the caller doesn't check  if the returned port is valid or not,
and just continue using this invalid number.

The setup doesn't have so many hypervisors and tunnels, and the reason why
the port numbers are exhausted is because of port number leak in corner
cases. Particularly, when OVN SB has redundant chassis (with same IP),
ovn-controller will create redundant tunnel interfaces. ovs-vswitchd fails
to add the redundant port to ofproto, but in this case every time
ovs-vswitchd tries to add the port, it generates a new port number without
freeing it afterwards. In this environment there are other events causing
ovsdb changes frequently, so every time ovsdb changes, ovs-vswitchd tries
to add the redundant port and leaks port numbers. Over a long period,
ovs-vswitchd enters a state that no valid number is available, thus
triggered the above problem that uses 65535 as the tunnel port number.

The recovery was pretty simple - just restart ovs on all GW nodes.

For these problems, I submitted two fixes:

(in addition, I am working on avoiding adding redundant entries to OVN SB
chassis table)

Now comes to my question. The time when all the GW BFD status went down
matches perfectly with the time when the port number 65535 is used.
However, I still didn't understand why would using the port number 65535
cause BFD status down on all tunnels (to other GWs and all hypervisors).
Could someone help explain here, so that we are confident that there is no
other potential problems?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20181107/032da538/attachment-0001.html>

More information about the discuss mailing list