<div dir="ltr"><div>Hi Han,</div><div><br></div><div>I am thinking of this approach to solve this problem. I still need to test it.</div><div>If you have any comments or concerns do let me know.</div><div><br></div><div><br></div><div>**************************************************</div><div>diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c<br>index 9a2222282..a83b56362 100644<br>--- a/northd/ovn-northd.c<br>+++ b/northd/ovn-northd.c<br>@@ -6552,6 +6552,41 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,<br> <br> }<br> <br>+ /* Handle GARP reply packets received on a distributed router gateway<br>+ * port. GARP reply broadcast packets could be sent by external<br>+ * switches. We don't want them to be handled by all the<br>+ * ovn-controllers if they receive it. So add a priority-92 flow to<br>+ * apply the put_arp action on a redirect chassis and drop it on<br>+ * other chassis.<br>+ * Note that we are already adding a priority-90 logical flow in the<br>+ * table S_ROUTER_IN_IP_INPUT to apply the put_arp action if<br>+ * arp.op == 2.<br>+ * */<br>+ if (op->od->l3dgw_port && op == op->od->l3dgw_port<br>+ && op->od->l3redirect_port) {<br>+ for (int i = 0; i < op->lrp_networks.n_ipv4_addrs; i++) {<br>+ ds_clear(&match);<br>+ ds_put_format(&match,<br>+ "inport == %s && is_chassis_resident(%s) && "<br>+ "eth.bcast && arp.op == 2 && arp.spa == %s/%u",<br>+ op->json_key, op->od->l3redirect_port->json_key,<br>+ op->lrp_networks.ipv4_addrs[i].network_s,<br>+ op->lrp_networks.ipv4_addrs[i].plen);<br>+ ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,<br>+ ds_cstr(&match),<br>+ "put_arp(inport, arp.spa, arp.sha);");<br>+ ds_clear(&match);<br>+ ds_put_format(&match,<br>+ "inport == %s && !is_chassis_resident(%s) && "<br>+ "eth.bcast && arp.op == 2 && arp.spa == %s/%u",<br>+ op->json_key, op->od->l3redirect_port->json_key,<br>+ op->lrp_networks.ipv4_addrs[i].network_s,<br>+ op->lrp_networks.ipv4_addrs[i].plen);<br>+ ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_INPUT, 92,<br>+ ds_cstr(&match), "drop;");<br>+ }<br>+ }<br>+<br> /* A set to hold all load-balancer vips that need ARP responses. */<br> struct sset all_ips = SSET_INITIALIZER(&all_ips);<br> int addr_family;<br></div><div>*************************************************</div><div><br></div><div>If a physical switch sends GARP request packets we have existing logical flows</div><div>which handle them only on the gateway chassis.</div><div><br></div><div>But if the physical switch sends GARP reply packets, then these packets</div><div>are handled by ovn-controllers where bridge mappings are configured.</div><div>I think its good enough if the gateway chassis handles these packet.</div><div><br></div><div>In the deployment where we are seeing this issue, the physical switch sends GARP reply</div><div>packets.</div><div><br></div><div>Thanks</div><div>Numan</div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 30, 2019 at 11:50 PM Han Zhou <<a href="mailto:zhouhan@gmail.com">zhouhan@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Aug 30, 2019 at 6:46 AM Mark Michelson <<a href="mailto:mmichels@redhat.com" target="_blank">mmichels@redhat.com</a>> wrote:<br>
><br>
> On 8/30/19 5:39 AM, Daniel Alvarez Sanchez wrote:<br>
> > On Thu, Aug 29, 2019 at 10:01 PM Mark Michelson <<a href="mailto:mmichels@redhat.com" target="_blank">mmichels@redhat.com</a>><br>
wrote:<br>
> >><br>
> >> On 8/29/19 2:39 PM, Numan Siddique wrote:<br>
> >>> Hello Everyone,<br>
> >>><br>
> >>> In one of the OVN deployments, we are seeing 100% CPU usage by<br>
> >>> ovn-controllers all the time.<br>
> >>><br>
> >>> After investigations we found the below<br>
> >>><br>
> >>> - ovn-controller is taking more than 20 seconds to complete full<br>
loop<br>
> >>> (mainly in lflow_run() function)<br>
> >>><br>
> >>> - The physical switch is sending GARPs periodically every 10<br>
seconds.<br>
> >>><br>
> >>> - There is ovn-bridge-mappings configured and these GARP packets<br>
> >>> reaches br-int via the patch port.<br>
> >>><br>
> >>> - We have a flow in router pipeline which applies the action -<br>
put_arp<br>
> >>> if it is arp packet.<br>
> >>><br>
> >>> - ovn-controller pinctrl thread receives these garps, stores the<br>
> >>> learnt mac-ips in the 'put_mac_bindings' hmap and notifies the<br>
> >>> ovn-controller main thread by incrementing the seq no.<br>
> >>><br>
> >>> - In the ovn-controller main thread, after lflow_run() finishes,<br>
> >>> pinctrl_wait() is called. This function calls - poll_immediate_wake()<br>
as<br>
> >>> 'put_mac_bindings' hmap is not empty.<br>
> >>><br>
> >>> - This causes the ovn-controller poll_block() to not sleep at all and<br>
> >>> this repeats all the time resulting in 100% cpu usage.<br>
> >>><br>
> >>> The deployment has OVS/OVN 2.9. We have back ported the<br>
pinctrl_thread<br>
> >>> patch.<br>
> >>><br>
> >>> Some time back I had reported an issue about lflow_run() taking lot of<br>
> >>> time -<br>
<a href="https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html" rel="noreferrer" target="_blank">https://mail.openvswitch.org/pipermail/ovs-dev/2019-July/360414.html</a><br>
> >>><br>
> >>> I think we need to improve the logical processing sooner or later.<br>
> >><br>
> >> I agree that this is very important. I know that logical flow<br>
processing<br>
> >> is the biggest bottleneck for ovn-controller, but 20 seconds is just<br>
> >> ridiculous. In your scale testing, you found that lflow_run() was<br>
taking<br>
> >> 10 seconds to complete.<br>
> > I support this statement 100% (20 seconds is just ridiculous). To be<br>
> > precise, in this deployment we see over 23 seconds for the main loop<br>
> > to process and I've seen even 30 seconds some times. I've been talking<br>
> > to Numan these days about this issue and I support profiling this<br>
> > actual deployment so that we can figure out how incremental processing<br>
> > would help.<br>
> ><br>
> >><br>
> >> I'm curious if there are any factors in this particular deployment's<br>
> >> configuration that might contribute to this. For instance, does this<br>
> >> deployment have a glut of ACLs? Are they not using port groups?<br>
> > They're not using port groups because it's 2.9 and it is not there.<br>
> > However, I don't think port groups would make a big difference in<br>
> > terms of ovn-controller computation. I might be wrong but Port Groups<br>
> > help reduce the number of ACLs in the NB database while the # of<br>
> > Logical Flows would still remain the same. We'll try to get the<br>
> > contents of the NB database and figure out what's killing it.<br>
> ><br>
><br>
> You're right that port groups won't reduce the number of logical flows.<br>
<br>
I think port-group reduces number of logical flows significantly, and also<br>
reduces OVS flows when conjunctive matches are effective.<br>
Please see my calculation here:<br>
<a href="https://www.slideshare.net/hanzhou1978/large-scale-overlay-networks-with-ovn-problems-and-solutions/30" rel="noreferrer" target="_blank">https://www.slideshare.net/hanzhou1978/large-scale-overlay-networks-with-ovn-problems-and-solutions/30</a><br>
<br>
> However, it can reduce the computation in ovn-controller. The reason is<br>
> that the logical flows generated by ACLs that use port groups may result<br>
> in conjunctive matches being used. If you want a bit more information,<br>
> see the "Port groups" section of this blog post I wrote:<br>
><br>
><br>
<a href="https://developers.redhat.com/blog/2019/01/02/performance-improvements-in-ovn-past-and-future/" rel="noreferrer" target="_blank">https://developers.redhat.com/blog/2019/01/02/performance-improvements-in-ovn-past-and-future/</a><br>
><br>
> The TL;DR is that with port groups, I saw the number of OpenFlow flows<br>
> generated by ovn-controller drop by 3 orders of magnitude. And that<br>
> meant that flow processing was 99% faster for large networks.<br>
><br>
> You may not see the same sort of improvement for this deployment, mainly<br>
> because my test case was tailored to illustrate how port groups help.<br>
> There may be other factors in this deployment that complicate flow<br>
> processing.<br>
><br>
> >><br>
> >> This particular deployment's configuration may give us a good scenario<br>
> >> for our testing to improve lflow processing time.<br>
> > Absolutely!<br>
> >><br>
> >>><br>
> >>> But to fix this issue urgently, we are thinking of the below approach.<br>
> >>><br>
> >>> - pinctrl_thread will locally cache the mac_binding entries (just<br>
like<br>
> >>> it caches the dns entries). (Please note pinctrl_thread can not access<br>
> >>> the SB DB IDL).<br>
> >><br>
> >>><br>
> >>> - Upon receiving any arp packet (via the put_arp action),<br>
pinctrl_thread<br>
> >>> will check the local mac_binding cache and will only wake up the main<br>
> >>> ovn-controller thread only if the mac_binding update is required.<br>
> >>><br>
> >>> This approach will solve the issue since the MAC sent by the physical<br>
> >>> switches will not change. So there is no need to wake up<br>
ovn-controller<br>
> >>> main thread.<br>
> >><br>
> >> I think this can work well. We have a lot of what's needed already in<br>
> >> pinctrl at this point. We have the hash table of mac bindings already.<br>
> >> Currently, we flush this table after we write the data to the<br>
southbound<br>
> >> database. Instead, we would keep the bindings in memory. We would need<br>
> >> to ensure that the in-memory MAC bindings eventually get deleted if<br>
they<br>
> >> become stale.<br>
> >><br>
> >>><br>
> >>> In the present master/2.12 these GARPs will not cause this 100% cpu<br>
loop<br>
> >>> issue because incremental processing will not recompute flows.<br>
> >><br>
> >> Another mitigating factor for master is something I'm currently working<br>
> >> on. I've got the beginnings of a patch series going where I am<br>
> >> separating pinctrl into a separate process from ovn-controller:<br>
> >> <a href="https://github.com/putnopvut/ovn/tree/pinctrl_process" rel="noreferrer" target="_blank">https://github.com/putnopvut/ovn/tree/pinctrl_process</a><br>
> >><br>
> >> It's in the early stages right now, so please don't judge :)<br>
> >><br>
> >> Separating pinctrl to its own process means that it cannot directly<br>
> >> cause ovn-controller to wake up like it currently might.<br>
> >><br>
> >>><br>
> >>> Even though the above approach is not really required for<br>
master/2.12, I<br>
> >>> think it is still Ok to have this as there is no harm.<br>
> >>><br>
> >>> I would like to know your comments and any concerns if any.<br>
> >><br>
> >> Hm, I don't really understand why we'd want to put this in master/2.12<br>
> >> if the problem doesn't exist there. The main concern I have is with<br>
> >> regards to cache lifetime. I don't want to introduce potential memory<br>
> >> growth concerns into a branch if it's not necessary.<br>
> >><br>
> >> Is there a way for us to get this included in 2.9-2.11 without having<br>
to<br>
> >> put it in master or 2.12? It's hard to classify this as a bug fix,<br>
> >> really, but it does prevent unwanted behavior in real-world setups.<br>
> >> Could we get an opinion from committers on this?<br>
> >><br>
> >>><br>
> >>> Thanks<br>
> >>> Numan<br>
> >>><br>
> >>><br>
> >>> _______________________________________________<br>
> >>> discuss mailing list<br>
> >>> <a href="mailto:discuss@openvswitch.org" target="_blank">discuss@openvswitch.org</a><br>
> >>> <a href="https://mail.openvswitch.org/mailman/listinfo/ovs-discuss" rel="noreferrer" target="_blank">https://mail.openvswitch.org/mailman/listinfo/ovs-discuss</a><br>
> >>><br>
> >><br>
> >> _______________________________________________<br>
> >> discuss mailing list<br>
> >> <a href="mailto:discuss@openvswitch.org" target="_blank">discuss@openvswitch.org</a><br>
> >> <a href="https://mail.openvswitch.org/mailman/listinfo/ovs-discuss" rel="noreferrer" target="_blank">https://mail.openvswitch.org/mailman/listinfo/ovs-discuss</a><br>
><br>
> _______________________________________________<br>
> dev mailing list<br>
> <a href="mailto:dev@openvswitch.org" target="_blank">dev@openvswitch.org</a><br>
> <a href="https://mail.openvswitch.org/mailman/listinfo/ovs-dev" rel="noreferrer" target="_blank">https://mail.openvswitch.org/mailman/listinfo/ovs-dev</a><br>
_______________________________________________<br>
dev mailing list<br>
<a href="mailto:dev@openvswitch.org" target="_blank">dev@openvswitch.org</a><br>
<a href="https://mail.openvswitch.org/mailman/listinfo/ovs-dev" rel="noreferrer" target="_blank">https://mail.openvswitch.org/mailman/listinfo/ovs-dev</a><br>
</blockquote></div></div>