[ovs-dev] [RFE] Event mechanism for a pro-active packet drop detection and recovery
blp at ovn.org
Fri Jul 5 19:53:40 UTC 2019
Wow. There's a lot here.
Some of my reactions:
- It's good to increase visibility.
- I don't know much about the importance of different kinds of
visibility or what kinds of tools will consume them downstream. I
don't know the ultimate goals.
- OVS doesn't currently implement d-bus. I don't know anything about
d-bus, such as how much work it is to implement, whether OVS would
need new library dependencies or how demanding those are, or whether
it could be cross-platform (i.e. also support Windows).
- One can introduce new individual features for tracking different kinds
of drops. One can also introduce different frameworks for reporting
them and alerting/alarming on them. I guess that these can probably
- There are multiple levels at which drops can happen or be detected. I
wonder whether all of these can be addressed by a single framework.
Do you have an idea for next steps? Sometimes it helps to have a
specific proposal to discuss, even if it is a straw man. Maybe writing
something up would help.
On Fri, Jun 28, 2019 at 11:10:14AM +0530, Gowrishankar Muthukrishnan wrote:
> Today (*), when a packet journey in the data path is disrupted and leading
> towards its drop, we have OVS counters to auto-detect it and show at the
> request of user space commands. Some category of drops are related to the
> interfaces that can be queried from OVS DB table for that interface ,
> while some are available in real time in the data path through respective
> OVS commands (eg, ovs-appctl coverage/show as in  and ovs-appctl
> dpctl/show as in ). It is unavoidable that the drop stats is split
> across multiple sources, but at the end of the day the user has to query
> by different ways to figure out:
> (1) there is packet drop
> (2) reason for the drop
> (3) miss precious opportunity to correct available resources in the
> data path to prevent further drops.
> To ease the difficulty in monitoring these data, we already have tool
> such as collectd  to record the events but IMHO there is slight async
> between what we have today and what we develop in our upstream, meaning
> collectd can know packet drops only in the context of interface table.
> However, the other category of drops (related to QoS, metering, tunnel,
> up call, re-circulation, mtu mismatch and even invalid packet etc) can not
> be monitored by collectd because, neither the association with
> the Interface table nor a separate table itself exist today.
> However, there is an indirect association for eg Flow_Table represents
> all the packet flow rules, and when a packet is dropped, it can only be
> checked in Flow_Table for any drop action but it is not unified attempt
> to quickly detect and correct resources. Thanks to our developers that
> these drops are someway recorded now but, in the field, the time to
> recover from the drops easily elapses given that, these stats first to be
> collected, be analysed by experts and then recovery action be applied.
> Also, there could be a pressing need to have very very minimal packet
> drops per million (ppm) .
> Hence, I would like to request suggestions from experts for how we can
> handle this situation through OVS and my humble ideas are below.
> (1) Unify the data collection into a common place:
> We can think of having a separate Data path table to record necessary
> contexts of a packet (drop reason and its count to start with). This
> will lead very minimal changes in the eco-system like collectd to sync.
> Work around until then is to continue using existing tables where ever
> possible, with additional statistics row if not exist.
> (2) Notify drop very soon or never!
> Instead of detecting DB records update (even after (1) above) with some
> latency in DB transactions to be in sync with real time data, why not OVS
> generate events to the consuming eco-system pro-actively ? I can think
> of D-bus for an instance to broadcast packet drop notifications.
> As a disclaimer, I'm not d-bus expert :) but it is just an idea to
> An analogy in terms of cli (though using its library it is good):
> <broadcasting event for every packet may be too much exhausting
> resources in the notification chain instead, follow guidelines set by
> the user. eg above allowable drop ppm ?? or even wait for signal to
> enable broadcast from registered monitoring agent in dbus).
> OVS: dbus-send --system --dest=net.ovsmon/net.ovsmon.Datapath.SetProperty
> string:Qfull variant:string:<port_name_that_packet_arrived>
> Monitor: dbus-monitor type=signal interface="net.ovsmon.Datapath"
> signal sender=net.ovsmon.Datapath -> dest=:1.102
> path=/net/ovsmon/Datapath; interface=net.ovsmon.Datapath; member=Qfull
> string "vhost-port-1"
> Monitor: dbus-send --system
> --dest=net.ovsmon/net.ovsmon.Interface.SetProperty string:<port_name>
> variant:string:"queue_size=<new value>"
> OVS: <to monitor and apply corrective action>
> If you think this sounds good, I can further think on prototyping it
> for a better demonstration or if it is other way, please suggest any
> better approach as well.
> * Below patches are in upstream as accepted/under review at present:
>  https://patchwork.ozlabs.org/patch/1123287/
>  https://patchwork.ozlabs.org/patch/1111568/
>  https://patchwork.ozlabs.org/patch/1115978/
>  http://www.openvswitch.org//ovs-vswitchd.conf.db.5.pdf
> Respective developers from above mail chains are CC'd however, others are
> also more welcome. Also, I think it is ovs-dev as appropriate ML for this
> Kind regards,
> Gowrishankar M
More information about the dev