[ovs-dev] [PATCH v9 2/2] netdev-dpdk:Detailed packet drop statistics

Sriram Vatala sriram.v at altencalsoftlabs.com
Wed Oct 16 14:20:58 UTC 2019


Hi Kevin & Ilya,
Thanks Kevin for reviewing the patch. Response inline.
Thanks Ilya for your response.

-----Original Message-----
From: Ilya Maximets <i.maximets at ovn.org>
Sent: 16 October 2019 17:46
To: Kevin Traynor <ktraynor at redhat.com>; Sriram Vatala 
<sriram.v at altencalsoftlabs.com>; ovs-dev at openvswitch.org; Ilya Maximets 
<i.maximets at ovn.org>
Cc: Stokes, Ian <ian.stokes at intel.com>
Subject: Re: [PATCH v9 2/2] netdev-dpdk:Detailed packet drop statistics

Hi Kevin,

Thanks for review. Some comments inline.

On 16.10.2019 12:15, Kevin Traynor wrote:
> Hi Sriram,
>
> Thanks for working on making more fine grained stats for debugging. As
> mentioned yesterday, this patch needs rebase so I just reviewed docs
> and netdev-dpdk.c which applied. Comments below.
>
> On 21/09/2019 03:40, Sriram Vatala wrote:
>> OVS may be unable to transmit packets for multiple reasons and today
>> there is a single counter to track packets dropped due to any of
>> those reasons. The most common reason is that a VM is unable to read
>> packets fast enough causing the vhostuser port transmit queue on the
>> OVS side to become full. This manifests as a problem with VNFs not
>> receiving all packets. Having a separate drop counter to track
>> packets dropped because the transmit queue is full will clearly
>> indicate that the problem is on the VM side and not in OVS. Similarly
>> maintaining separate
>

> It reads like a bit of a contradiction. On one hand the "The *most
> common* reason is that a VM is unable to read packets fast enough".
> While having a stat "*will clearly indicate* that the problem is on
> the VM side".
>

    I mentioned one use-case to emphasis the need for separate drop stats 
besides combined drop counter. I will rephrase the commit message.

>> counters for all possible drops helps in indicating sensible cause
>> for packet drops.
>>
>> This patch adds custom software stats counters to track packets
>> dropped at port level and these counters are displayed along with
>> other stats in "ovs-vsctl get interface <iface> statistics"
>> command. The detailed stats will be available for both dpdk and
>> vhostuser ports.
>>
>
> I think the commit msg could be a bit clearer, suggest something like
> below - feel free to modify (or reject):
>
> OVS may be unable to transmit packets for multiple reasons on the DPDK
> datapath and today there is a single counter to track packets dropped
> due to any of those reasons.
>
> This patch adds custom software stats for the different reasons
> packets may be dropped during tx on the DPDK datapath in OVS.
>
> - MTU drops : drops that occur due to a too large packet size
> - Qos drops: drops that occur due to egress QOS
> - Tx failures: drops as returned by the DPDK PMD send function
>
> Note that the reason for tx failures is not specificied in OVS. In
> practice for vhost ports it is most common that tx failures are
> because there are not enough available descriptors, which is usually
> caused by misconfiguration of the guest queues and/or because the
> guest is not consuming packets fast enough from the queues.
>
> These counters are displayed along with other stats in "ovs-vsctl get
> interface <iface> statistics" command and are available for dpdk and
> vhostuser/vhostuserclient ports.
>
    Looks good to me. Thanks for the suggestion.
> Signed-off-by: Sriram Vatala <sriram.v at altencalsoftlabs.com>
>
> ---
>
> v9:...
> v8:...
>
>
> Note that signed-off, --- and version info should be like this ^^^.
> otherwise the review version comments will be in the git commit log
> when it is applied.
>
   I will take care of this in my next patch.
>> --
>> Changes since v8:
>> Addressed comments given by Ilya.
>>
>> Signed-off-by: Sriram Vatala <sriram.v at altencalsoftlabs.com>
>> ---
>>   Documentation/topics/dpdk/vhost-user.rst      | 13 ++-
>>   lib/netdev-dpdk.c                             | 81 +++++++++++++++----
>>   utilities/bugtool/automake.mk                 |  3 +-
>>   utilities/bugtool/ovs-bugtool-get-port-stats  | 25 ++++++
>>   .../plugins/network-status/openvswitch.xml    |  1 +
>>   5 files changed, 105 insertions(+), 18 deletions(-)
>>   create mode 100755 utilities/bugtool/ovs-bugtool-get-port-stats
>>
>> diff --git a/Documentation/topics/dpdk/vhost-user.rst
>> b/Documentation/topics/dpdk/vhost-user.rst
>> index 724aa62f6..89388a2bf 100644
>> --- a/Documentation/topics/dpdk/vhost-user.rst
>> +++ b/Documentation/topics/dpdk/vhost-user.rst
>> @@ -551,7 +551,18 @@ processing packets at the required rate.
>>   The amount of Tx retries on a vhost-user or vhost-user-client interface 
>> can be
>>   shown with::
>>
>> -  $ ovs-vsctl get Interface dpdkvhostclient0 statistics:tx_retries
>> +  $ ovs-vsctl get Interface dpdkvhostclient0
>> + statistics:netdev_dpdk_tx_retries
>
> I think the "netdev_dpdk_" prefixes should be removed for a few
> reasons,
>
> - These are custom stats that will only be displayed for the dpdk
> ports so they don't need any extra prefix to say they are for dpdk
> ports
>
> - The 'tx_retries' stat is part of OVS 2.12, I don't like to change
> its name to a different one in OVS 2.13 unless there is a very good
> reason
>
> - The existing stats don't have this type of prefixes (granted most of
> them are general stats):

The main reason for the prefix for me is to distinguish those stats from the 
stats that comest from the driver/HW.  Our new stats has very generic names 
that could be named a lot like or even equal to HW stats.  This might be not 
an issue for vhostuser, but for HW NICs this may be really confusing for 
users.

I'm not insisting on 'netdev_dpdk_' prefix, but, IMHO, we need a way to 
clearly distinguish those stats.  We may use different prefixes like 'sw_' or 
just 'ovs_'.

As Ilya has already mentioned, the reason for using the prefix is to 
distinguish the custom stats with driver stats. I am not insisting on this. 
Please share your thoughts on this. I can use the approach that is good to go 
with, for all.

> # ovs-vsctl get Interface dpdkvhost0 statistics
> {"rx_1024_to_1522_packets"=0, "rx_128_to_255_packets"=0,
> "rx_1523_to_max_packets"=0, "rx_1_to_64_packets"=25622176,
> "rx_256_to_511_packets"=0, "rx_512_to_1023_packets"=0,
> "rx_65_to_127_packets"=0, rx_bytes=1537330560, rx_dropped=0,
> rx_errors=0, rx_packets=25622176, tx_bytes=3829825920, tx_dropped=0,
> tx_packets=63830432, tx_retries=0}
>
> Also, just to note that if there are changes to existing
> interfaces/behaviour it should really mention that in the commit
> message so it is highlighted.
>
OK. There are no changes to existing interfaces stats or behaviour except that 
name for "tx_retires" is changed. This might change based on the final 
approach of stats name(prefix/non-prefix) that is good to go with, for all. I 
will definitely mention in the commit message if there are any changes.
>> +
>> +When the guest is not able to consume the packets fast enough, the
>> +transmit queue of the port gets filled up i.e queue runs out of free 
>> descriptors.
>> +This is the most likely reason why dpdk transmit API will fail to
>> +send packets besides other reasons.
>> +
>> +The amount of tx failure drops on a dpdk vhost/physical interface
>> +can be shown with::
>> +
>> +  $ ovs-vsctl get Interface dpdkvhostclient0 \
>> +                            statistics:netdev_dpdk_tx_failure_drops
>>
>
> The commit msg/docs are only focusing on one stat for vhost ports, but
> there are other stats and dpdk ports, so they should all get some mention.

I don't feel comfortable for documenting custom stats.  This is just because 
we can't describe them all.  These are things to be changed over time.
And we don't really know what some types of stats means and if they means the 
same for different types of HW NICs (they are definitely not) or OVS netdevs 
(even within netdev-dpdk).
And that is one more reason to have a prefix for OVS internal statistics on 
which we have at least partial control.

I think, that user documentation could mention some special statistics while 
describing the troubleshooting for some hard special cases, but this should 
not be a glossary of all the possible custom stats.  From my point of view, 
names of the stats should be as possible self-descriptive and should not 
require additional documentation.

BTW, you will not find any description for statistics provided by the linux or 
DPDK drivers.  You could only look at the driver code or device spec.

I felt that "tx failure drops" need some explanation apart from other stats 
because names for other stats are sensible to user, besides I have added brief 
description for every counter in comment lines above the stats structure. 
Hence documented only one stat. I am not insisting on this. Please share your 
thoughts.

Thanks & Regards,
Sriram.


More information about the dev mailing list