[ovs-dev] [PATCH] ovn pacemaker: Provide the option to configure inactivity probe value

Fri Oct 13 21:26:48 UTC 2017

On Fri, Oct 13, 2017 at 12:06:56PM -0400, Russell Bryant wrote:
> On Fri, Oct 13, 2017 at 8:30 AM, Numan Siddique <nusiddiq at redhat.com> wrote:
> > On Fri, Oct 13, 2017 at 6:05 AM, Andy Zhou <azhou at ovn.org> wrote:
> >
> >> Hi, Numan,
> >>
> >> I am curious why default 5 seconds inactivity time does not work? Do
> >> you have more details?
> >>
> >> Does the glitch usually happen around the HA switch over?  If this
> >> happens during normal operation,
> >> Then this is not HA specific issue, but an indication of some
> >> connectivity issues.
> >>
> >
> > Hi Andy. This happens in the openstack deployment and when the
> > neutron-server is busy handling lots of API requests.
> > Normally the deployment would be having 3 controller nodes and
> > neutron-server would be running in each node.  On each controller node,
> > neutron-server starts around 10 - 12 neutron workers (which are separate
> > processes).  Number of API workers is a configuration option and normally
> > number of cores = no of neutron works if not configured.
> >
> > I have tested  in both physical nodes deployment and virtual deployment (3
> > controllers running as vms in a node). Around 40 connections are opened to
> > the OVN north ovsdb-server by all the neutron workers in the physical
> > deployment and around 15 connections are opened in the virtual deployment.
> > When neutron-server is loaded with many API requests, I have noticed that,
> > ovsdb-server drops the connections when it doesn't get the echo reply every
> > 5 seconds. This leads to lot of reconnections to the ovsdb-server and the
> > response from the neutron-server is very slow and bad.  With this patch it
> > seems to work fine.
> >
> > The issue is not because of any network issues but because of lots of
> > connections from the neutron-server workers to the ovsdb-server and failure
> > by the idl clients to reply to the echo request every 5 seconds when the
> > neutron-server is loaded.
> 
> We have to disable the inactivity probe everywhere each time we have
> done performance testing so far.

Really this seems that it's a bug (or inadequacy) in ovsdb-server.  It's
pretty sad that ovsdb-server can't reply within 5 seconds (maybe there's
a 2x or 3x multiplier on the response time, I don't recall).  I hope
that the clustered database does better here.

That said, if in the real world we need 60 seconds for now, let's use it
but remember that we should get our act together later.  (Maybe a
comment would be helpful.)