[ovs-discuss] ovn-controller is taking 100% CPU all the time in one deployment

Thu Aug 29 18:39:50 UTC 2019

Hello Everyone,

In one of the OVN deployments, we are seeing 100% CPU usage by
ovn-controllers all the time.

After investigations we found the below

 - ovn-controller is taking more than 20 seconds to complete full loop
(mainly in lflow_run() function)

 - The physical switch is sending GARPs periodically every 10 seconds.

 - There is ovn-bridge-mappings configured and these GARP packets reaches
br-int via the patch port.

 - We have a flow in router pipeline which applies the action - put_arp
if it is arp packet.

 - ovn-controller pinctrl thread receives these garps, stores the learnt
mac-ips in the 'put_mac_bindings' hmap and notifies the ovn-controller main
thread by incrementing the seq no.

 - In the ovn-controller main thread, after lflow_run() finishes,
pinctrl_wait() is called. This function calls - poll_immediate_wake() as
'put_mac_bindings' hmap is not empty.

- This causes the ovn-controller poll_block() to not sleep at all and this
repeats all the time resulting in 100% cpu usage.

The deployment has OVS/OVN 2.9.  We have back ported the pinctrl_thread
patch.

Some time back I had reported an issue about lflow_run() taking lot of time
- https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html

I think we need to improve the logical processing sooner or later.

But to fix this issue urgently, we are thinking of the below approach.

 - pinctrl_thread will locally cache the mac_binding entries (just like it
caches the dns entries). (Please note pinctrl_thread can not access the SB
DB IDL).

- Upon receiving any arp packet (via the put_arp action), pinctrl_thread
will check the local mac_binding cache and will only wake up the main
ovn-controller thread only if the mac_binding update is required.

This approach will solve the issue since the MAC sent by the physical
switches will not change. So there is no need to wake up ovn-controller
main thread.

In the present master/2.12 these GARPs will not cause this 100% cpu loop
issue because incremental processing will not recompute flows.

Even though the above approach is not really required for master/2.12, I
think it is still Ok to have this as there is no harm.

I would like to know your comments and any concerns if any.

Thanks
Numan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20190830/9e552344/attachment.html>