[ovs-dev] [PATCH] Granular link health statistics for cfm.

Ethan Jackson ethan at nicira.com
Wed Apr 4 21:42:19 UTC 2012


> Is this "health" percentage something that we invented, or is it an
> implementation of a standard, etc.?

It's something we invented.  It's not meant to be super accurate, just
to give us a general sense that something is wrong so that an operator
can take a look.

> It's kind of funny that receiving one healthy heartbeat is 100%
> health, but receiving one healthy heartbeat and one marginally healthy
> one is "less healthy".

I'm not sure what this means precisely. Could you please expand?

> I don't think that we can realistically expect exactly 7 CCM packets
> in 7 fault intervals.  It's a matter of luck.  It will be OK, if the
> CCM packets arrive in the middle of our fault intervals, like this:
>
>   -------------------------------------------->  time
>   X                    X                    X    fault intervals
> ^     ^     ^     ^     ^     ^     ^     ^     ^ CCM reception
>
> But if reception of CCM packets happens to be aligned closely to the
> fault intervals, like this:
>
>   -------------------------------------------->  time
>   X                    X                    X    fault intervals
>   ^     ^     ^     ^     ^     ^     ^     ^    CCM reception
>
> then we could easily end up receiving 6 CCMs in some intervals and 8
> CCMs in other intervals and get a lower "health" score even though the
> latter situation is exactly as "healthy" as the former.

Yes we thought about this while designing the feature.  It seems fine
to me if the health percentage dips a bit in this case.  I think
health percentage fluctuations are unavoidable in this case, but we
can probably do things to mitigate it.  One thing would be allowing
the percentage to float above 100% in the case where more than the
expected number of CCMs was received, this would balance out the
intervals where less is received.  Basically this would be implemented
by letting the value float, and only truncating from upstream's
perspective it in the call to cfm_get_health().  The other thing we
could do is increase the weight of the old percentage over the newer
data in an attempt to smooth out the graph.

My thoughts are that users of this feature will basically set an alarm
if the health percentage dips below a certain number (95 or something)
 We just need to be sure that the math works out so that this case
doesn't cause alarms to go off.


> I think that the initialization for the algorithm is suboptimal.
> Suppose that after the first "health interval" we've received one CCM
> from a remote MP.  I'd expect this to be poor health, 1 out of 7
> (perhaps 14%), but my reading of the code is that we'd give it 57%
> because we average it with an initial value of 100%.  Would it be
> better, the first time that the algorithm runs for a given remote MP,
> to use the new calculated value without averaging it with any initial
> value?
>
> The "weighted moving average" applies only to each individual remote
> MP.  If no CCMs are received within a given interval, then the remote
> MP will be deleted.  If a remote MP appears in the next interval and
> we receive all 7 CCMs from it, then the health will jump up to 100%
> instantly, even though that's pretty unrealistic (it couldn't have
> been more than 50% in the previous interval assuming there was only
> one remote MP).

Perhaps to solve both of these issue, we could say a new remote MP has
a health of 0.  When a new one enters, it would cause an initial dip
in the health percentage, but that seems fine to me.  Thoughts?

> When a new remote MP appears in the middle of a "health interval", it
> will initially get an artificially low health score.  For example, if
> a new remote MP appears just before the end of a health interval, it
> cannot receive an initial health better than 57%, which may be
> deceptive.

I think this is a reasonable trade off.  We just need to be consistent
in saying that health starts low initially in the documentation.  I
don't think there's a general way of solving the above problems short
of requiring users to explicitly state in advance what remote MPs the
expect to see.

Just now I did think of one possible alternative approach which may
solve the problem.  We could allow the health percentage to float
above 100, and say that 100 health points corresponds roughly to one
healthy remote MP.  Using this algorithm, instead of maintaining a
health percentage for each remote_mp individually,  we would simply
maintain a rolling average for the CFM module.  If you have one remote
MP, the average would hover around 100, two 200, three 300 etc.  This
has the advantage of robustly handling the case where remote MPs are
popping in and out of existence.



More information about the dev mailing list