[ovs-discuss] OVN at scale in production

Seena Fallah seenafallah at gmail.com
Fri Oct 15 10:53:36 UTC 2021


In the case of having many projects each project has at least 2 security
groups and each security group has 5 ACLs this ACL number should be not
very high I think.
In ovs scenario, I have 250K ACLs and everything works fine!
Do you think OVN is not ready for this number of ACLs?
I'm switching from ovs to ovn.

On Fri, Oct 15, 2021 at 4:41 AM Han Zhou <hzhou at ovn.org> wrote:

>
>
> On Thu, Oct 14, 2021 at 7:25 AM Seena Fallah <seenafallah at gmail.com>
> wrote:
>
>> It's mostly on nb.
>>
> I am surprised since we usually don't see any scale problem for the NB DB
> servers, because usually SB data size is much bigger and also number of
> clients are much bigger than NB DB. So if there are scale problems it would
> always happen on SB already before NB hits any limit.
> You would see NB scale problem but not on SB probably because ovn-northd
> couldn't even translate the NB data to SB yet because of the NB problem you
> hit. I'd suggest to start with smaller scale, and make sure it works end to
> end, and then enlarge it gradually, then you would see the real limit.
> Somehow 100k ACLs sound scary to me. Usually the number of ACLs is not so
> big but each ACL could reference big address-sets and port-groups. You
> could probably give more details about your topology and what your typical
> ACLs look like.
>
>
>> Yes, I set that value before to 60000 but it didn't help!
>>
>> On Sun, Oct 10, 2021 at 10:34 PM Han Zhou <hzhou at ovn.org> wrote:
>>
>>>
>>>
>>> On Sat, Oct 9, 2021 at 12:02 PM Seena Fallah <seenafallah at gmail.com>
>>> wrote:
>>> >
>>> > Also I get many logs like this in ovn:
>>> >
>>> > 2021-10-09T18:54:45.263Z|01151|jsonrpc|WARN|Dropped 6 log messages in
>>> last 8 seconds (most recently, 3 seconds ago) due to excessive rate
>>> > 2021-10-09T18:54:45.263Z|01152|jsonrpc|WARN|tcp:10.0.0.1:44454:
>>> receive error: Connection reset by peer
>>> > 2021-10-09T18:54:45.263Z|01153|reconnect|WARN|tcp:10.0.01:44454:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:46.798Z|01154|reconnect|WARN|tcp:10.0.0.2:50224:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:49.127Z|01155|reconnect|WARN|tcp:10.0.0.3:48514:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:51.241Z|01156|reconnect|WARN|tcp:10.0.0.3:48544:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:53.005Z|01157|reconnect|WARN|tcp:10.0.0.3:48846:
>>> connection dropped (Connection reset by peer)
>>> > 2021-10-09T18:54:53.246Z|01158|reconnect|WARN|tcp:10.0.0.3:48796:
>>> connection dropped (Connection reset by peer)
>>> >
>>> > What does it mean about excessive rate? How many req/s is going to be
>>> an excessive rate?
>>>
>>> Don't worry about "excessive rate", which is talking about the log rate
>>> limit itself.
>>> The "connection reset by peer" indicates client side inactivity probe is
>>> enabled and it disconnects when the server hasn't responded for a while.
>>> What server is this? NB or SB? Usually SB DB would have this problem if
>>> there are lots of nodes and if the inactivity probe is not adjusted on the
>>> nodes (ovn-controllers). Try: ovs-vsctl set open .
>>> external_ids:ovn-remote-probe-interval=100000 on each node.
>>>
>>> >
>>> > On Thu, Oct 7, 2021 at 12:46 AM Seena Fallah <seenafallah at gmail.com>
>>> wrote:
>>> >>
>>> >> Seems the most leader failure is for NB and the command you said is
>>> for SB.
>>> >>
>>> >> Do you have any benchmarks of how many ACLs can OVN perform normally?
>>> >> I see many failures after 100k ACLs.
>>> >>
>>> >> On Thu, Oct 7, 2021 at 12:14 AM Numan Siddique <numans at ovn.org>
>>> wrote:
>>> >>>
>>> >>> On Wed, Oct 6, 2021 at 2:49 PM Seena Fallah <seenafallah at gmail.com>
>>> wrote:
>>> >>> >
>>> >>> > I'm using these versions on a centos container:
>>> >>> > ovsdb-server (Open vSwitch) 2.15.2
>>> >>> > ovn-nbctl 21.06.0
>>> >>> > Open vSwitch Library 2.15.90
>>> >>> > DB Schema 5.32.0
>>> >>> >
>>> >>> > Today I see the election timed out too and I should increase ovsdb
>>> election timeout too. I saw the commits but I didn't find any related
>>> change to my problem.
>>> >>> > If I use ovn 21.09 with ovsdb 2.16 Is there still any need to
>>> increase election timeout and disable the inactivity probe?
>>> >>>
>>> >>> Not sure on that.  It's worth a try if you have a test environment.
>>> >>>
>>> >>> > Also is there any limitation on the number of ACLs that can OVN
>>> handle?
>>> >>>
>>> >>> I don't think there is any limitation on the number of ACLs.  In
>>> >>> general as the size of the SB DB increases, we have seen issues.
>>> >>>
>>> >>> Can you run the below command on each of your nodes where
>>> >>> ovn-controller runs and see if that helps ?
>>> >>>
>>> >>> ---
>>> >>> ovs-vsctl set open . external_ids:ovn-monitor-all=true
>>> >>> ---
>>> >>>
>>> >>> Thanks
>>> >>> Numan
>>> >>>
>>> >>>
>>> >>> >
>>> >>> > Thanks.
>>> >>> >
>>> >>> > On Wed, Oct 6, 2021 at 9:43 PM Numan Siddique <numans at ovn.org>
>>> wrote:
>>> >>> >>
>>> >>> >> On Wed, Oct 6, 2021 at 12:15 PM Seena Fallah <
>>> seenafallah at gmail.com> wrote:
>>> >>> >> >
>>> >>> >> > Hi,
>>> >>> >> >
>>> >>> >> > I use ovn for OpenStack neutron plugin for my production. After
>>> days I see issues about losing a leader in ovsdb. It seems it was because
>>> of the failing inactivity probe and because I had 17k acls. After I disable
>>> the inactivity probe it works fine but when I did a scale test on it (about
>>> 40k ACLS) again it fails the leader.
>>> >>> >> > I saw many docs about ovn at scale issues that were raised by
>>> both RedHat and eBay and seems the solution is to rewrite ovn with ddlog. I
>>> checked it with northd-ddlog but nothing changes.
>>> >>> >> >
>>> >>> >> > My question is should I wait more for ovn to be stable for high
>>> scale or is there any tuning I miss in my deployment?
>>> >>> >> > Also, will the ovn-nb/sb rewrite with ddlog and can help the
>>> issues at a high scale? if yes is there any due time?
>>> >>> >>
>>> >>> >> What is the ovsdb-server version you're using ?  There are many
>>> >>> >> improvements in the ovsdb-server in 2.16.
>>> >>> >> Maybe that would help in your deployment.  And also there were
>>> many
>>> >>> >> improvements which went into OVN 21.09
>>> >>> >> if you want to test it out.
>>> >>> >>
>>> >>> >> Thanks
>>> >>> >> Numan
>>> >>> >>
>>> >>> >> >
>>> >>> >> > Thanks.
>>> >>> >> > _______________________________________________
>>> >>> >> > discuss mailing list
>>> >>> >> > discuss at openvswitch.org
>>> >>> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > discuss mailing list
>>> >>> > discuss at openvswitch.org
>>> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>> >
>>> > _______________________________________________
>>> > discuss mailing list
>>> > discuss at openvswitch.org
>>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20211015/6bb05c99/attachment-0001.html>


More information about the discuss mailing list