[ovs-dev] [PATCH] Granular link health statistics for cfm.
Mehak Mahajan
mmahajan at nicira.com
Wed Apr 4 21:43:23 UTC 2012
Hi Ben
On Wed, Apr 4, 2012 at 2:15 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Wed, Apr 04, 2012 at 11:38:42AM -0700, Mehak Mahajan wrote:
> > The changes display the cfm_health of an interface. The cfm_health
> > is an exponential weighted moving average of the health of all
> > remote_mpids. The value can vary from 0 to 100, 100 being very healthy
> > and 0 being unhealthy.
> >
> > Feature #10363
> > Requested-by: Ethan Jackson <ethan at nicira.com>
> > Signed-off-by: Mehak Mahajan <mmahajan at nicira.com>
>
> Is this "health" percentage something that we invented, or is it an
> implementation of a standard, etc.?
>
> I think this is really something that we are using.
> It's kind of funny that receiving one healthy heartbeat is 100%
> health, but receiving one healthy heartbeat and one marginally healthy
> one is "less healthy".
>
> Actually if we received all the "healthy" heartbeats, only then are we
healthy. If we receive even one unhealthy heartbeat we will not be 100%
healthy.
> I don't think that we can realistically expect exactly 7 CCM packets
> in 7 fault intervals. It's a matter of luck. It will be OK, if the
> CCM packets arrive in the middle of our fault intervals, like this:
>
> --------------------------------------------> time
> X X X fault intervals
> ^ ^ ^ ^ ^ ^ ^ ^ ^ CCM reception
>
> But if reception of CCM packets happens to be aligned closely to the
> fault intervals, like this:
>
> --------------------------------------------> time
> X X X fault intervals
> ^ ^ ^ ^ ^ ^ ^ ^ CCM reception
>
> then we could easily end up receiving 6 CCMs in some intervals and 8
> CCMs in other intervals and get a lower "health" score even though the
> latter situation is exactly as "healthy" as the former.
>
> We are expecting to receive 7 healthy heartbeats every 2 fault intervals.
According to the code, a fault interval is defined as 3.5 * cfm_interval
(which is configurable. Seems we generally set it to be 300 ms).
Essentially what we are suggesting is that we should be receivng a
heartbeat every cfm_interval. In order to get to an integral number of
heartbeats, instead of considering only 1 fault interval, we are
considering two (to get a whole number).
I can think of a case where in a perfectly healthy link with 0 packet loss,
we get 8 CCM frames. But I am not sure if we can receive less than 7 CCM
frames in 2 fault intervals ? I am not sure I I understood the diagram
clearly.
> I think that the initialization for the algorithm is suboptimal.
> Suppose that after the first "health interval" we've received one CCM
> from a remote MP. I'd expect this to be poor health, 1 out of 7
> (perhaps 14%), but my reading of the code is that we'd give it 57%
> because we average it with an initial value of 100%. Would it be
> better, the first time that the algorithm runs for a given remote MP,
> to use the new calculated value without averaging it with any initial
> value?
>
Yes. I guess we should start the averaging with 0. I will change that to
the health of a link being 0 at intialization.
>
> The "weighted moving average" applies only to each individual remote
> MP. If no CCMs are received within a given interval, then the remote
> MP will be deleted. If a remote MP appears in the next interval and
> we receive all 7 CCMs from it, then the health will jump up to 100%
> instantly, even though that's pretty unrealistic (it couldn't have
> been more than 50% in the previous interval assuming there was only
> one remote MP).
>
> When a new remote MP appears in the middle of a "health interval", it
> will initially get an artificially low health score. For example, if
> a new remote MP appears just before the end of a health interval, it
> cannot receive an initial health better than 57%, which may be
> deceptive.
>
> I think this comment isn't right:
> > + int health_interval; /* Num of fault_intervals used to compute
> the
> > + * health. */
>
> Some lines use tabs for indentation but should use spaces, e.g.:
> > + if (ccm_rdi) {
> > + fault = true;
> > + }
>
>
Will look into the indentation issues.
> Please add a space after HMAP_FOR_EACH:
> > + HMAP_FOR_EACH(rmp, node, &cfm->remote_mps) {
>
> Done.
> I believe the comment below should also say to return -1 if CFM is not
> configured:
> > + /* Checks the health of CFM configured on 'ofport'. Returns an
> integer
> > + * to indicate the health percentage of the 'ofport' which is an
> average of
> > + * the health of all the remote_mps. Returns an integer between 0
> and 100
> > + * where 0 means that the 'ofport' is very unhealthy and 100 means
> the
> > + * 'ofport' is perfectly healthy.
> > + *
> > + * This function may be a null pointer if the ofproto
> implementation does
> > + * not support CFM. */
> > + int (*get_cfm_health)(const struct ofport *ofport);
> > +
> > /* Configures spanning tree protocol (STP) on 'ofproto' using the
> > * settings defined in 's'.
> > *
>
> Will modify it.
> In iface_refresh_cfm_stats(), you can declare cfm_health in the inner
> block that uses it, and the cast to int64_t should not be necessary.
>
> vswitch.xml uses some jargon that users should not need to be
> comfortable with, or that we should at least define:
>
> "3.5/'fault_interval" isn't necessarily meaningful; we didn't
> define "fault interval". Perhaps "if not received at the
> expected rate"?
>
> "remote_mpids" isn't defined. Maybe "all remote MPID listed
> in <ref column="cfm_remote_mpids"/>s"?
>
> I don't think that the health is actually refreshed every 2 seconds in
> every case. That's approximately true for a 300 ms transmission
> interval (7 * 300 ms == 2.1 s).
>
> Modified to read "refreshed approximately every 2 seconds".
> I don't think we need to say "The health of a interface can vary from
> 0 to 100" because the ovsdb-doc processor already should say that (and
> elsewhere we say it's a percentage).
>
> I will cleanup the description a bit as you have suggested.
> Thanks,
>
> Ben.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20120404/934492b9/attachment-0003.html>
More information about the dev
mailing list