[ovs-dev] [PATCH] ovsdb-cs: Avoid unnecessary re-connections when updating remotes.
hzhou at ovn.org
Tue Jun 29 17:29:59 UTC 2021
On Tue, Jun 29, 2021 at 8:43 AM Ben Pfaff <blp at ovn.org> wrote:
> On Tue, Jun 29, 2021 at 12:56:18PM +0200, Ilya Maximets wrote:
> > If a new database server added to the cluster, or if one of the
> > database servers changed its IP address or port, then you need to
> > update the list of remotes for the client. For example, if a new
> > OVN_Southbound database server is added, you need to update the
> > ovn-remote for the ovn-controller.
> > However, in the current implementation, the ovsdb-cs module always
> > closes the current connection and creates a new one. This can lead
> > to a storm of re-connections if all ovn-controllers will be updated
> > simultaneously. They can also start re-dowloading the database
> > content, creating even more load on the database servers.
> > Correct this by saving an existing connection if it is still in the
> > list of remotes after the update.
> > 'reconnect' module will report connection state updates, but that
> > is OK since no real re-connection happened and we only updated the
> > state of a new 'reconnect' instance.
> > If required, re-connection can be forced after the update of remotes
> > with ovsdb_cs_force_reconnect().
> I think one of the goals here was to keep the load balanced as servers
> are added. Maybe that's not a big deal, or maybe it would make sense to
> flip a coin for each of the new servers and switch over to it with
> probability 1/n where n is the number of servers.
A similar load-balancing problem exists also when a server is down and then
recovered. Connections will obviously move away when it is down but they
won't automatically connect back when it is recovered. Apart from the
flipping-a-coin approach suggested by Ben, I saw a proposal   in the
past that provides a CLI to reconnect to a specific server which leaves
this burden to CMS/operators. It is not ideal but still could be an
alternative to solve the problem.
I think both approaches have their pros and cons. The smart way doesn't
require human intervention in theory, but when operating at scale people
usually want to be cautious and have more control over the changes. For
example, they may want to add the server to the cluster first, and then
gradually move 1/n connections to the new server after a graceful period,
or they could be more conservative and only let the new server take new
connections without moving any existing connections. I'd support both
options and let the operators decide according to their requirements.
Regarding the current patch, I think it's better to add a test case to
cover the scenario and confirm that existing connections didn't reset. With
Acked-by: Han Zhou <hzhou at ovn.org>
More information about the dev