[ovs-discuss] ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

Han Zhou zhouhan at gmail.com
Fri Aug 7 18:46:08 UTC 2020


On Thu, Aug 6, 2020 at 10:22 AM Han Zhou <zhouhan at gmail.com> wrote:

>
>
> On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique <numans at ovn.org> wrote:
>
>>
>>
>> On Thu, Aug 6, 2020 at 9:25 PM Venugopal Iyer <venugopali at nvidia.com>
>> wrote:
>>
>>> Hi, Han:
>>>
>>>
>>>
>>> A comment inline:
>>>
>>>
>>>
>>> *From:* ovn-kubernetes at googlegroups.com <ovn-kubernetes at googlegroups.com>
>>> *On Behalf Of *Han Zhou
>>> *Sent:* Wednesday, August 5, 2020 3:36 PM
>>> *To:* Winson Wang <windson.wang at gmail.com>
>>> *Cc:* ovs-discuss at openvswitch.org; ovn-kubernetes at googlegroups.com;
>>> Dumitru Ceara <dceara at redhat.com>; Han Zhou <hzhou at ovn.org>
>>> *Subject:* Re: ovn-k8s scale: how to make new ovn-controller process
>>> keep the previous Open Flow in br-int
>>>
>>>
>>>
>>> *External email: Use caution opening links or attachments*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 5, 2020 at 12:58 PM Winson Wang <windson.wang at gmail.com>
>>> wrote:
>>>
>>> Hello OVN Experts,
>>>
>>>
>>> With ovn-k8s,  we need to keep the flows always on br-int which needed
>>> by running pods on the k8s node.
>>>
>>> Is there an ongoing project to address this problem?
>>>
>>> If not,  I have one proposal not sure if it is doable.
>>>
>>> Please share your thoughts.
>>> The issue:
>>>
>>> In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on
>>> every K8s node.  When we restart ovn-controller for upgrade using
>>> `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still
>>> works fine since br-int with flows still be Installed.
>>>
>>>
>>>
>>> However, when a new ovn-controller starts it will connect OVS IDL and do
>>> an engine init run,  clearing all OpenFlow flows and install flows based on
>>> SB DB.
>>>
>>> With open flows count above 200K+,  it took more than 15 seconds to get
>>> all the flows installed br-int bridge again.
>>>
>>>
>>> Proposal solution for the issue:
>>>
>>> When the ovn-controller gets “exit --start”,  it will write a
>>> “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in
>>> external-ids column. When new ovn-controller starts, it will check if the
>>> “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from
>>> OVS IDL to decide if it will force a recomputing process?
>>>
>>>
>>>
>>>
>>>
>>> Hi Winson,
>>>
>>>
>>>
>>> Thanks for the proposal. Yes, the connection break during upgrading is a
>>> real issue in a large scale environment. However, the proposal doesn't
>>> work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB,
>>> which is a completely different connection from the ovs-vswitchd open-flow
>>> connection.
>>>
>>> To avoid clearing the open-flow table during ovn-controller startup, we
>>> can find a way to postpone clearing the OVS flows after the recomputing in
>>> ovn-controller is completed, right before ovn-controller replacing with the
>>> new flows.
>>>
>>> *[vi> ] *
>>>
>>> *[vi> ] Seems like we force recompute today if the OVS IDL is
>>> reconnected. Would it be possible to defer *
>>>
>>> *decision to  recompute the flows based on  the  SB’s nb_cfg we have
>>>  sync’d with? i.e.  If  our nb_cfg is *
>>>
>>> *in sync with the SB’s global nb_cfg, we can skip the recompute?  At
>>> least if nothing has changed since*
>>>
>>> *the restart, we won’t need to do anything.. We could stash nb_cfg in
>>> OVS (once ovn-controller receives*
>>>
>>> *conformation from OVS that the physical flows for an nb_cfg update are
>>> in place), which should be cleared if *
>>>
>>> *OVS itself is restarted.. (I mean currently, nb_cfg is used to check if
>>> NB, SB and Chassis are in sync, we *
>>>
>>> *could extend this to OVS/physical flows?)*
>>>
>>>
>>>
>>> *Have not thought through this though .. so maybe I am missing
>>> something…*
>>>
>>>
>>>
>>> *Thanks,*
>>>
>>>
>>>
>>> *-venu*
>>>
>>> This should largely reduce the time of connection broken during
>>> upgrading. Some changes in the ofctrl module's state machine are required,
>>> but I am not 100% sure if this approach is applicable. Need to check more
>>> details.
>>>
>>
>>
>> We can also think if its possible to do the below way
>>    - When ovn-controller starts, it will not clear the flows, but instead
>> will get the dump of flows  from the br-int and populate these flows in its
>> installed flows
>>     - And then when it connects to the SB DB and computes the desired
>> flows, it will anyway sync up with the installed flows with the desired
>> flows
>>     - And if there is no difference between desired flows and installed
>> flows, there will be no impact on the datapath at all.
>>
>> Although this would require a careful thought and proper handling.
>>
>
> Numan, as I responded to Girish, this avoids the time spent on the
> one-time flow installation after restart (the < 10% part of the connection
> broken time), but I think currently the major problem is that > 90% of the
> time is spent on waiting for computing to finish while the OVS flows are
> already cleared. It is surely an optimization, but the most important one
> now is to avoid the 90% time. I will look at postpone clearing flows first.
>
>

I thought about this again. It seems more complicated than it appeared and
let me summarize here:

The connection break time during the upgrading consists two parts:
1) The time gap between flow clearing and the start of the flow
installation for the fully computed flows, i.e. waiting for flow
installation.
2) The time spent during flow installation, which takes several rounds of
ovn-controller main loop iteration. (I take back my earlier statement that
this contributes only 10% of the total time. According to the log shared by
Girish, it seems at least more than 50% of the time is spent here).

For 1), postponing clearing flows is the solution, but it is not as easy as
I thought, because there is no easy way to determine if ovn-controller has
completed the initial computing.
When ovn-controller starts, it initializes the IDL connections with SB and
local OVSDB, and sends the initial monitor conditions to SB DB. It may take
several rounds of receiving SB notifications, update monitor conditions,
and computing to generate all flows required. If we replace the flows to
OVS before it is fully complete, it would end up with the same problem. I
can't think of an ideal and clean approach to solve the problem. However, a
"not so good" solution could be, support an option for ovn-controller
command to delay the clearing of OVS flows. It is then the operator's job
to figure out the best time to delay, according to the scale of their
environment, to reduce the time gap on waiting for the new flow
installation. This is not an ideal approach, but I think it should be
helpful for large scale environment upgrading in practise. Thoughts?

For 2), Numan's suggestion of syncing back OVS flows before flow
installation and installing only the delta (without clearing the flows)
seems to be perfect solution. However, there are some tricky parts that
need to be considered:
1. Apart from OVS flows, meter and group table also need to be restored
2. The installed flows in ovn-controller require some other metadata that
is not available from OVS, such as sb_uuid.
3. The syncing itself may take significant extra cost and further delays
the initialization.

Alternatively, for 2), I think probably we can utilize the "bundle"
operation of OpenFlow to replace the flows in OVS atomically (on
ovs-vswitchd side) which should avoid the long connection break. I am not
sure which one is more applicable yet.

I'd also like to emphasize that even though the solution for 2) doesn't
clear flows, it doesn't avoid problem 1) automatically, because we will
still need to figure out when the major flow compute is complete and ready
to be installed/synced to OVS. Otherwise, we could replace the old huge
flow tables with a small number of incompleted flows, which still results
in the same connection break.

Thanks,
Han


>> Thanks
>> Numan
>>
>>
>>>
>>> Thanks,
>>>
>>> Han
>>>
>>> Test log:
>>>
>>> Check flow cnt on br-int every second:
>>>
>>>
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=0
>>>
>>> packet_count=0 byte_count=0 flow_count=10322
>>>
>>> packet_count=0 byte_count=0 flow_count=34220
>>>
>>> packet_count=0 byte_count=0 flow_count=60425
>>>
>>> packet_count=0 byte_count=0 flow_count=82506
>>>
>>> packet_count=0 byte_count=0 flow_count=106771
>>>
>>> packet_count=0 byte_count=0 flow_count=131648
>>>
>>> packet_count=2 byte_count=120 flow_count=158303
>>>
>>> packet_count=29 byte_count=1693 flow_count=185999
>>>
>>> packet_count=188 byte_count=12455 flow_count=212764
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Winson
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ovn-kubernetes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS8eC2EtMJbqBccGD0hyvLFBkzkeJ9sXOsT_TVF3Ltm2hA%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS8eC2EtMJbqBccGD0hyvLFBkzkeJ9sXOsT_TVF3Ltm2hA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ovn-kubernetes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCn5wEGZZ4%3DdovxhQZ2cgWpEyiPhbChk9amodnxNVgeQxQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCn5wEGZZ4%3DdovxhQZ2cgWpEyiPhbChk9amodnxNVgeQxQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ovn-kubernetes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ovn-kubernetes/BYAPR12MB33495002B3970889CA9293B3BC480%40BYAPR12MB3349.namprd12.prod.outlook.com
>>> <https://groups.google.com/d/msgid/ovn-kubernetes/BYAPR12MB33495002B3970889CA9293B3BC480%40BYAPR12MB3349.namprd12.prod.outlook.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200807/184d78b3/attachment-0001.html>


More information about the discuss mailing list