[ovs-dev] [PATCH] Granular link health statistics for cfm.

Wed Apr 4 21:43:23 UTC 2012

Hi Ben

On Wed, Apr 4, 2012 at 2:15 PM, Ben Pfaff <blp at nicira.com> wrote:

> On Wed, Apr 04, 2012 at 11:38:42AM -0700, Mehak Mahajan wrote:
> > The changes display the cfm_health of an interface.  The cfm_health
> > is an exponential weighted moving average of the health of all
> > remote_mpids.  The value can vary from 0 to 100, 100 being very healthy
> > and 0 being unhealthy.
> >
> > Feature #10363
> > Requested-by: Ethan Jackson <ethan at nicira.com>
> > Signed-off-by: Mehak Mahajan <mmahajan at nicira.com>
>
> Is this "health" percentage something that we invented, or is it an
> implementation of a standard, etc.?
>
> I think this is really something that we are using.

> It's kind of funny that receiving one healthy heartbeat is 100%
> health, but receiving one healthy heartbeat and one marginally healthy
> one is "less healthy".
>
> Actually if we received all the "healthy" heartbeats, only then are we
healthy. If we receive even one unhealthy heartbeat we will not be 100%
healthy.

> I don't think that we can realistically expect exactly 7 CCM packets
> in 7 fault intervals.  It's a matter of luck.  It will be OK, if the
> CCM packets arrive in the middle of our fault intervals, like this:
>
>   -------------------------------------------->  time
>   X                    X                    X    fault intervals
> ^     ^     ^     ^     ^     ^     ^     ^     ^ CCM reception
>
> But if reception of CCM packets happens to be aligned closely to the
> fault intervals, like this:
>

>   -------------------------------------------->  time
>   X                    X                    X    fault intervals
>   ^     ^     ^     ^     ^     ^     ^     ^    CCM reception
>
> then we could easily end up receiving 6 CCMs in some intervals and 8
> CCMs in other intervals and get a lower "health" score even though the
> latter situation is exactly as "healthy" as the former.
>
> We are expecting to receive 7 healthy heartbeats every 2 fault intervals.
According to the code, a fault interval is defined as 3.5 * cfm_interval
(which is configurable. Seems we generally set it to be 300 ms).
Essentially what we are suggesting is that we should be receivng a
heartbeat every cfm_interval. In order to get to an integral number of
heartbeats, instead of considering only 1 fault interval, we are
considering two (to get a whole number).
I can think of a case where in a perfectly healthy link with 0 packet loss,
we get 8 CCM frames. But I am not sure if we can receive less than 7 CCM
frames in 2 fault intervals ? I am not sure I I understood the diagram
clearly.

> I think that the initialization for the algorithm is suboptimal.
> Suppose that after the first "health interval" we've received one CCM
> from a remote MP.  I'd expect this to be poor health, 1 out of 7
> (perhaps 14%), but my reading of the code is that we'd give it 57%
> because we average it with an initial value of 100%.  Would it be
> better, the first time that the algorithm runs for a given remote MP,
> to use the new calculated value without averaging it with any initial
> value?
>

Yes. I guess we should start the averaging with 0. I will change that to
the health of a link being 0 at intialization.

>
> The "weighted moving average" applies only to each individual remote
> MP.  If no CCMs are received within a given interval, then the remote
> MP will be deleted.  If a remote MP appears in the next interval and
> we receive all 7 CCMs from it, then the health will jump up to 100%
> instantly, even though that's pretty unrealistic (it couldn't have
> been more than 50% in the previous interval assuming there was only
> one remote MP).
>
> When a new remote MP appears in the middle of a "health interval", it
> will initially get an artificially low health score.  For example, if
> a new remote MP appears just before the end of a health interval, it
> cannot receive an initial health better than 57%, which may be
> deceptive.
>
> I think this comment isn't right:
> > +    int health_interval;      /* Num of fault_intervals used to compute
> the
> > +                               * health. */
>
> Some lines use tabs for indentation but should use spaces, e.g.:
> > +        if (ccm_rdi) {
> > +         fault = true;
> > +        }
>
>
Will look into the indentation issues.

> Please add a space after HMAP_FOR_EACH:
> > +            HMAP_FOR_EACH(rmp, node, &cfm->remote_mps) {
>
> Done.

> I believe the comment below should also say to return -1 if CFM is not
> configured:
> > +    /* Checks the health of CFM configured on 'ofport'.  Returns an
> integer
> > +     * to indicate the health percentage of the 'ofport' which is an
> average of
> > +     * the health of all the remote_mps.  Returns an integer between 0
> and 100
> > +     * where 0 means that the 'ofport' is very unhealthy and 100 means
> the
> > +     * 'ofport' is perfectly healthy.
> > +     *
> > +     * This function may be a null pointer if the ofproto
> implementation does
> > +     * not support CFM. */
> > +    int (*get_cfm_health)(const struct ofport *ofport);
> > +
> >      /* Configures spanning tree protocol (STP) on 'ofproto' using the
> >       * settings defined in 's'.
> >       *
>
> Will modify it.

> In iface_refresh_cfm_stats(), you can declare cfm_health in the inner
> block that uses it, and the cast to int64_t should not be necessary.
>
> vswitch.xml uses some jargon that users should not need to be
> comfortable with, or that we should at least define:
>
>        "3.5/'fault_interval" isn't necessarily meaningful; we didn't
>        define "fault interval".  Perhaps "if not received at the
>        expected rate"?
>
>        "remote_mpids" isn't defined.  Maybe "all remote MPID listed
>        in <ref column="cfm_remote_mpids"/>s"?
>
> I don't think that the health is actually refreshed every 2 seconds in
> every case.  That's approximately true for a 300 ms transmission
> interval (7 * 300 ms == 2.1 s).
>
> Modified to read "refreshed approximately every 2 seconds".

> I don't think we need to say "The health of a interface can vary from
> 0 to 100" because the ovsdb-doc processor already should say that (and
> elsewhere we say it's a percentage).
>
> I will cleanup the description a bit as you have suggested.

> Thanks,
>
> Ben.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20120404/934492b9/attachment-0003.html>