[ovs-dev] [PATCH] ovsdb-cs: Avoid unnecessary re-connections when updating remotes.

Tue Jun 29 18:05:26 UTC 2021

On Tue, Jun 29, 2021 at 10:29:59AM -0700, Han Zhou wrote:
> On Tue, Jun 29, 2021 at 8:43 AM Ben Pfaff <blp at ovn.org> wrote:
> >
> > On Tue, Jun 29, 2021 at 12:56:18PM +0200, Ilya Maximets wrote:
> > > If a new database server added to the cluster, or if one of the
> > > database servers changed its IP address or port, then you need to
> > > update the list of remotes for the client.  For example, if a new
> > > OVN_Southbound database server is added, you need to update the
> > > ovn-remote for the ovn-controller.
> > >
> > > However, in the current implementation, the ovsdb-cs module always
> > > closes the current connection and creates a new one.  This can lead
> > > to a storm of re-connections if all ovn-controllers will be updated
> > > simultaneously.  They can also start re-dowloading the database
> > > content, creating even more load on the database servers.
> > >
> > > Correct this by saving an existing connection if it is still in the
> > > list of remotes after the update.
> > >
> > > 'reconnect' module will report connection state updates, but that
> > > is OK since no real re-connection happened and we only updated the
> > > state of a new 'reconnect' instance.
> > >
> > > If required, re-connection can be forced after the update of remotes
> > > with ovsdb_cs_force_reconnect().
> >
> > I think one of the goals here was to keep the load balanced as servers
> > are added.  Maybe that's not a big deal, or maybe it would make sense to
> > flip a coin for each of the new servers and switch over to it with
> > probability 1/n where n is the number of servers.
> 
> A similar load-balancing problem exists also when a server is down and then
> recovered. Connections will obviously move away when it is down but they
> won't automatically connect back when it is recovered. Apart from the
> flipping-a-coin approach suggested by Ben, I saw a proposal [0] [1] in the
> past that provides a CLI to reconnect to a specific server which leaves
> this burden to CMS/operators. It is not ideal but still could be an
> alternative to solve the problem.
> 
> I think both approaches have their pros and cons. The smart way doesn't
> require human intervention in theory, but when operating at scale people
> usually want to be cautious and have more control over the changes. For
> example, they may want to add the server to the cluster first, and then
> gradually move 1/n connections to the new server after a graceful period,
> or they could be more conservative and only let the new server take new
> connections without moving any existing connections. I'd support both
> options and let the operators decide according to their requirements.
> 
> Regarding the current patch, I think it's better to add a test case to
> cover the scenario and confirm that existing connections didn't reset. With
> that:
> Acked-by: Han Zhou <hzhou at ovn.org>

This seems reasonable; to be sure, I'm not arguing against Ilya's
appproach, just trying to explain my recollection of why it was done
this way.