[ovs-discuss] OVN Scale with RAFT: how to make ovn-northd more reliable when RAFT leader unstable

Thu Jul 16 18:23:31 UTC 2020

On 7/15/20 8:02 PM, Winson Wang wrote:
> +add ovn-Kubernetes group.
> 
> Hi Dumitru,
> 
> With recent patches from you and Han,  now for k8s basic workload, such
> node resources and pod resources are fixed and look good.
> Much thanks!

Hi Winson,

Glad to hear that!

> 
> For k8s workload which exposes as svc IP is every common,  for example,
> the coreDNS pod's deployment.
> With large cluster size such  as 1000,  there is service to auto scale
> up coreDNS deployment,  if we use default 16 nodes per coredns,  it could be
> 63 coredns pods.
> On my 1006 nodes setup,  deployment from coreDNS from 2 to 63.
> SB raft election 16s is not good for this operation in my test
> environment, it makes one raft node cannot finish the election in two
> election slot when making all it's
> clients disconnect and reconnect to two other raft nodes,  which makes
> raft clients in an unbalanced state after this operation.
> This condition might be avoided without larger election timer.
> 
> For the SB and work node resource side:
> SB DB size increased 27MB.
> br-int open flows increased around 369K, 
> RSS memory of (ovs + ovn-controller) increased more than 600MB.

This increase on the hypervisor side is most likely because of the
openflows for hairpin traffic for VIPs (service IP). To confirm, would
it be possible to take a snapshot of the OVS flow table and see how many
flows there are per table?

> 
> So if OVN experts can figure how to optimize it would be very great for
> ovn-k8s scale up to large cluster size I think.
> 

If the above is due to flows for LB flows to handle hairpin traffic, the
only idea I have is to use OVS "learn" action to have the flows
generated as needed. However, I didn't get the chance to try it out yet.

Thanks,
Dumitru

> 
> Regards,
> Winson
> 
> 
> On Fri, May 1, 2020 at 1:35 AM Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>> wrote:
> 
>     On 5/1/20 12:00 AM, Winson Wang wrote:
>     > Hi Han,  Dumitru,
>     >
> 
>     Hi Winson,
> 
>     > With the fix from Dumitru
>     >
>     https://github.com/ovn-org/ovn/commit/97e82ae5f135a088c9e95b49122d8217718d23f4
>     >
>     > It can greatly reduced the OVS SB RAFT workload based on my stress
>     test
>     > mode with k8s svc with large endpoints.
>     >
>     > The DB file size increased much less with fix, so it will not trigger
>     > the leader election with same work load.
>     >
>     > Dumitru,  based my test,  logic flows number is fixed with cluster
>     size
>     > regardless of number of VIP endpoints.
> 
>     The number of logical flows will be fixed based on number of VIPs (2 per
>     VIP) but the size of the match expression depends on the number of
>     backends per VIP so the SB DB size will increase when adding backends to
>     existing VIPs.
> 
>     >
>     > But the open flow count on each node still have the relationship
>     of the
>     > endpoints size.
> 
>     Yes, this is due to the match expression in the logical flow above which
>     is of the form:
> 
>     (ip.src == backend-ip1 && ip.dst == backend-ip2) || .. || (ip.src ==
>     backend-ipn && ip.dst == backend-ipn)
> 
>     This will get expanded to n openflow rules, one per backend, to
>     determine if traffic was hairpinned.
> 
>     > Any idea how to reduce the open flow cnt on each node's br-int?
>     >
>     >
> 
>     Unfortunately I don't think there's a way to determine if traffic was
>     hairpinned because I don't think we can have openflow rules that match
>     on "ip.src == ip.dst". So in the worst case, we will probably need two
>     openflow rules per backend IP (one for initiator traffic, one for
>     reply).
> 
>     I'll think more about it though.
> 
>     Regards,
>     Dumitru
> 
>     > Regards,
>     > Winson
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > On Wed, Apr 29, 2020 at 1:42 PM Winson Wang
>     <windson.wang at gmail.com <mailto:windson.wang at gmail.com>
>     > <mailto:windson.wang at gmail.com <mailto:windson.wang at gmail.com>>>
>     wrote:
>     >
>     >     Hi Han,
>     >
>     >     Thanks for quick reply.
>     >     Please see my reply below.
>     >
>     >     On Wed, Apr 29, 2020 at 12:31 PM Han Zhou <hzhou at ovn.org
>     <mailto:hzhou at ovn.org>
>     >     <mailto:hzhou at ovn.org <mailto:hzhou at ovn.org>>> wrote:
>     >
>     >
>     >
>     >         On Wed, Apr 29, 2020 at 10:29 AM Winson Wang
>     >         <windson.wang at gmail.com <mailto:windson.wang at gmail.com>
>     <mailto:windson.wang at gmail.com <mailto:windson.wang at gmail.com>>> wrote:
>     >         >
>     >         > Hello Experts,
>     >         >
>     >         > I am doing stress with k8s cluster with ovn,  one thing I am
>     >         seeing is that when raft nodes
>     >         > got update for large data in short time from ovn-northd,  3
>     >         raft nodes will trigger voting and leader role switched
>     from one
>     >         node to another.
>     >         >
>     >         > From ovn-northd side,  I can see ovn-northd will trigger the
>     >         BACKOFF, RECONNECT...
>     >         >
>     >         > Since ovn-northd only connect to NB/SB leader only and
>     how can
>     >         we make ovn-northd more available  in most of the time?
>     >         >
>     >         > Is it possible to make ovn-northd have established
>     connections
>     >         to all raft nodes to avoid the
>     >         > reconnect mechanism?
>     >         > Since the backoff time 8s is not configurable for now.
>     >         >
>     >         >
>     >         > Test logs:
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:08.296Z|41861|ovsdb_idl|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >         clustered database server is not cluster leader; trying
>     another
>     >         server
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:08.296Z|41862|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering RECONNECT
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:08.304Z|41863|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering BACKOFF
>     >         >
>     >         > 2020-04-29T17:03:09.708Z|41867|coverage|INFO|Dropped 2 log
>     >         messages in last 78 seconds (most recently, 71 seconds
>     ago) due
>     >         to excessive rate
>     >         >
>     >         > 2020-04-29T17:03:09.708Z|41868|coverage|INFO|Skipping
>     details
>     >         of duplicate event coverage for hash=ceada91f
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.304Z|41869|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering CONNECTING
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.308Z|41870|reconnect|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >         connected
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.308Z|41871|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering ACTIVE
>     >         >
>     >         >
>     2020-04-29T17:03:16.308Z|41872|ovn_northd|INFO|ovn-northd lock
>     >         lost. This ovn-northd instance is now on standby.
>     >         >
>     >         >
>     2020-04-29T17:03:16.309Z|41873|ovn_northd|INFO|ovn-northd lock
>     >         acquired. This ovn-northd instance is now active.
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.311Z|41874|ovsdb_idl|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >         clustered database server is disconnected from cluster; trying
>     >         another server
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.311Z|41875|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering RECONNECT
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:16.312Z|41876|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering BACKOFF
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.316Z|41877|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >         <http://10.0.2.151:6642>: entering CONNECTING
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.321Z|41878|reconnect|INFO|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642> <http://10.0.2.151:6642>:
>     >         connected
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.321Z|41879|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >         <http://10.0.2.151:6642>: entering ACTIVE
>     >         >
>     >         >
>     2020-04-29T17:03:24.321Z|41880|ovn_northd|INFO|ovn-northd lock
>     >         lost. This ovn-northd instance is now on standby.
>     >         >
>     >         >
>     2020-04-29T17:03:24.354Z|41881|ovn_northd|INFO|ovn-northd lock
>     >         acquired. This ovn-northd instance is now active.
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.358Z|41882|ovsdb_idl|INFO|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642> <http://10.0.2.151:6642>:
>     >         clustered database server is not cluster leader; trying
>     another
>     >         server
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.358Z|41883|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >         <http://10.0.2.151:6642>: entering RECONNECT
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:24.360Z|41884|reconnect|DBG|tcp:10.0.2.151:6642
>     <http://10.0.2.151:6642>
>     >         <http://10.0.2.151:6642>: entering BACKOFF
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.367Z|41885|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering CONNECTING
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.372Z|41886|reconnect|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >         connected
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.372Z|41887|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering ACTIVE
>     >         >
>     >         >
>     2020-04-29T17:03:32.372Z|41888|ovn_northd|INFO|ovn-northd lock
>     >         lost. This ovn-northd instance is now on standby.
>     >         >
>     >         >
>     2020-04-29T17:03:32.373Z|41889|ovn_northd|INFO|ovn-northd lock
>     >         acquired. This ovn-northd instance is now active.
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.376Z|41890|ovsdb_idl|INFO|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642> <http://10.0.2.152:6642>:
>     >         clustered database server is not cluster leader; trying
>     another
>     >         server
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.376Z|41891|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering RECONNECT
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:32.378Z|41892|reconnect|DBG|tcp:10.0.2.152:6642
>     <http://10.0.2.152:6642>
>     >         <http://10.0.2.152:6642>: entering BACKOFF
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:40.381Z|41893|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering CONNECTING
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:40.385Z|41894|reconnect|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642> <http://10.0.2.153:6642>:
>     >         connected
>     >         >
>     >         >
>     >       
>      2020-04-29T17:03:40.385Z|41895|reconnect|DBG|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >         <http://10.0.2.153:6642>: entering ACTIVE
>     >         >
>     >         >
>     2020-04-29T17:03:40.385Z|41896|ovn_northd|INFO|ovn-northd lock
>     >         lost. This ovn-northd instance is now on standby.
>     >         >
>     >         >
>     2020-04-29T17:03:40.385Z|41897|ovn_northd|INFO|ovn-northd lock
>     >         acquired. This ovn-northd instance is now active.
>     >         >
>     >         >
>     >         > --
>     >         > Winson
>     >
>     >         Hi Winson,
>     >
>     >         Since northd heavily writes to SB DB, it is implemented to
>     >         connect to leader only, for better performance (avoid the
>     extra
>     >         cost of a follower forwarding writes to leader). When leader
>     >         re-election happened, it has to reconnect to the new leader.
>     >         However, if the cluster is unstable, this step would also take
>     >         longer time than expected. I'd suggest to tune the election
>     >         timer to avoid re-election during heavy operations.
>     >
>     >     I can see with election timer to higher value can avoid this,
>     but if
>     >     more stress generated then I  see it happen again.
>     >     For real workload,  it may not hit the spike stress I trigger for
>     >     stress test, so this is just for scale profiling.
>     >      
>     >
>     >
>     >         If the server is overloaded for too long and longer election
>     >         timer is unacceptable, the only way to solve the availability
>     >         problem is to improve ovsdb performance. How big is your
>     >         transaction and what's your election timer setting?
>     >
>     >     I can see ovn-northd send 33MB data in short time,  and
>     ovsdb-server
>     >     need sync with clients,  I run iftop on on-controller side, each
>     >     node will receive around 25MB update.
>     >     Each ovn-controller get 25MB data,  3 raft nodes total send
>     25*646 ~16GB
>     >      
>     >
>     >         The number of clients also impacts the performance since the
>     >         heavy update needs to be synced to all clients. How many
>     clients
>     >         do you have?
>     >
>     >     Is there one mechanism for all the ovn-controller clients to
>     connect
>     >     to raft followers only to skip leader?
>     >     That will make leader node more cpu resource for voting and
>     cluster
>     >     level sync.
>     >     Based my stress test,  after ovn-controller connected to 2
>     follower
>     >     nodes,  leader node only connect to ovn-northd.
>     >     This model can finish raft voting finish in shorter time when
>     >     ovn-northd trigger same workload.
>     >      
>     >      Total clients is 646 nodes.
>     >     Before the leader role changes,  all clients connected to 3
>     nodes in
>     >     balanced way,  each raft node has 200+ connections.
>     >     After lead role change,  ovn controller side get the following
>     messages:
>     >   
>      2020-04-29T04:21:14.566Z|00674|ovsdb_idl|INFO|tcp:10.0.2.153:6642
>     <http://10.0.2.153:6642>
>     >     <http://10.0.2.153:6642>: clustered database server is
>     disconnected
>     >     from cluster; trying another server
>     >
>     >     Node 10.0.2.153 :
>     >
>     >     SB role changed from follower to candidate on 21:21:06
>     >
>     >     SB role changed from candidate to leader on 21:22:16
>     >
>     >     netstat for 6642 port connections:
>     >
>     >     21:21:31 ESTABLISHED 202
>     >
>     >     21:21:31 Pending 0
>     >
>     >     21:21:41 ESTABLISHED 0
>     >
>     >     21:21:41 Pending 0
>     >
>     >
>     >     The above node in candidate role for more than 60s which more than
>     >     my election timer setting 30s.
>     >
>     >     all the 202 connections of node (10.0.2.153) shift to the
>     other two
>     >     nodes in short time. After that only
>     >
>     >     ovn-northd connected to this node.
>     >
>     >
>     >     Node 10.0.2.151 <http://10.0.2.151>:
>     >
>     >     SB role changed from leader to follower on 21:21:23
>     >
>     >
>     >     21:21:35 ESTABLISHED 233
>     >
>     >     21:21:35 Pending 0
>     >
>     >     21:21:45 ESTABLISHED 282
>     >
>     >     21:21:45 Pending 9
>     >
>     >     21:21:55 ESTABLISHED 330
>     >
>     >     21:21:55 Pending 1
>     >
>     >     21:22:05 ESTABLISHED 330
>     >
>     >     21:22:05 Pending 1
>     >
>     >
>     >
>     >     Node 10.0.2.152 <http://10.0.2.152>:
>     >
>     >     SB role changed from follower to candidate on 21:21:57
>     >
>     >     SB role changed from candidate to follower on 21:22:17
>     >
>     >
>     >     21:21:35 ESTABLISHED 211
>     >
>     >     21:21:35 Pending 0
>     >
>     >     21:21:45 ESTABLISHED 263
>     >
>     >     21:21:45 Pending 5
>     >
>     >     21:21:55 ESTABLISHED 316
>     >
>     >     21:21:55 Pending 0
>     >
>     >
>     >
>     >
>     >         Thanks,
>     >         Han
>     >
>     >
>     >
>     >     --
>     >     Winson
>     >
>     >
>     >
>     > --
>     > Winson
> 
> 
> 
> -- 
> Winson