[ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS functionality.

Flavio Leitner fbl at sysclose.org
Thu Feb 18 16:04:07 UTC 2016


On Wed, 17 Feb 2016 14:40:26 +0000
"Stokes, Ian" <ian.stokes at intel.com> wrote:

> > -----Original Message-----
> > From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Stokes, Ian
> > Sent: Thursday, February 11, 2016 2:42 PM
> > To: Flavio Leitner
> > Cc: dev at openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS functionality.
> >   
> > > On Thu, 11 Feb 2016 12:32:56 +0000
> > > "Stokes, Ian" <ian.stokes at intel.com> wrote:
> > >  
> > > > Thanks For the review Flavio, much appreciated, comments inline.
> > > > I'll  
> > > re-spin a new version also.
> > >
> > > Thank you for the patch!
> > > comments inline.
> > >
> > >  
> > > >  
> > > > > -----Original Message-----
> > > > > From: Flavio Leitner [mailto:fbl at sysclose.org]
> > > > > Sent: Wednesday, February 10, 2016 7:55 PM
> > > > > To: Stokes, Ian
> > > > > Cc: dev at openvswitch.org
> > > > > Subject: Re: [ovs-dev] [PATCH v3] netdev_dpdk.c: Add QoS  
> > > functionality.  
> > > > >
> > > > > On Mon,  1 Feb 2016 20:47:25 +0000 Ian Stokes
> > > > > <ian.stokes at intel.com> wrote:
> > > > >  
> > > > > > This patch provides the modifications required in netdev-dpdk.c
> > > > > > and vswitch.xml to allow for a DPDK user space QoS algorithm.
> > > > > >
> > > > > > This patch adds a QoS configuration structure for netdev-dpdk
> > > > > > and expected QoS operations 'dpdk_qos_ops'. Various helper
> > > > > > functions are also supplied.
> > > > > >
> > > > > > Also included are the modifications required for vswitch.xml to
> > > > > > allow a new QoS implementation for netdev-dpdk devices. This
> > > > > > includes a new QoS type `egress-policer` as well as its expected  
> > > QoS table entries.  
> > > > > >
> > > > > > The QoS functionality implemented for DPDK devices is `egress-  
> > > > > policer`.  
> > > > > > This can be used to drop egress packets at a configurable rate.
> > > > > >
> > > > > > The INSTALL.DPDK.md guide has also been modified to provide an
> > > > > > example configuration of `egress-policer` QoS.
> > > > > >
> > > > > > Signed-off-by: Ian Stokes <ian.stokes at intel.com>
> > > > > > ---
> > > > > >  INSTALL.DPDK.md      |   20 +++
> > > > > >  lib/netdev-dpdk.c    |  439  
> > > > > ++++++++++++++++++++++++++++++++++++++++++++++++--  
> > > > > >  vswitchd/vswitch.xml |   52 ++++++
> > > > > >  3 files changed, 498 insertions(+), 13 deletions(-)
> > > > > >
> > > > > > diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index
> > > > > > e8ef4b5..dac70cd
> > > > > > 100644
> > > > > > --- a/INSTALL.DPDK.md
> > > > > > +++ b/INSTALL.DPDK.md
> > > > > > @@ -207,6 +207,26 @@ Using the DPDK with ovs-vswitchd:
> > > > > >     ./ovs-ofctl add-flow br0 in_port=2,action=output:1
> > > > > >     ```
> > > > > >
> > > > > > +8. QoS usage example
> > > > > > +
> > > > > > +   Assuming you have a vhost-user port transmitting traffic  
> > > > > consisting of  
> > > > > > +   packets of size 64 bytes, the following command would limit
> > > > > > + the  
> > > > > egress  
> > > > > > +   transmission rate of the port to ~1,000,000 packets per  
> > > second:  
> > > > > > +
> > > > > > +   `ovs-vsctl set port vhost-user0 qos=@newqos -- --id=@newqos
> > > > > > + create  
> > > > > qos  
> > > > > > +   type=egress-policer other-config:cir=46000000
> > > > > > + other-config:cbs=2048`
> > > > > > +
> > > > > > +   To examine the QoS configuration of the port:
> > > > > > +
> > > > > > +   `ovs-appctl -t ovs-vswitchd qos/show vhost-user0`
> > > > > > +
> > > > > > +   To clear the QoS configuration from the port and ovsdb use
> > > > > > + the  
> > > > > following:  
> > > > > > +
> > > > > > +   `ovs-vsctl -- destroy QoS vhost-user0 -- clear Port
> > > > > > + vhost-user0 qos`
> > > > > > +
> > > > > > +   For more details regarding egress-policer parameters please
> > > > > > + refer  
> > > > > to the  
> > > > > > +   vswitch.xml.
> > > > > > +
> > > > > >  Performance Tuning:
> > > > > >  -------------------
> > > > > >
> > > > > > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> > > > > > 09ccc2c..716da79 100644
> > > > > > --- a/lib/netdev-dpdk.c
> > > > > > +++ b/lib/netdev-dpdk.c
> > > > > > @@ -44,6 +44,7 @@
> > > > > >  #include "ovs-rcu.h"
> > > > > >  #include "packets.h"
> > > > > >  #include "shash.h"
> > > > > > +#include "smap.h"
> > > > > >  #include "sset.h"
> > > > > >  #include "unaligned.h"
> > > > > >  #include "timeval.h"
> > > > > > @@ -52,12 +53,14 @@
> > > > > >
> > > > > >  #include "rte_config.h"
> > > > > >  #include "rte_mbuf.h"
> > > > > > +#include "rte_meter.h"
> > > > > >  #include "rte_virtio_net.h"
> > > > > >
> > > > > >  VLOG_DEFINE_THIS_MODULE(dpdk);
> > > > > >  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> > > > > >
> > > > > >  #define DPDK_PORT_WATCHDOG_INTERVAL 5
> > > > > > +#define DPDK_MAX_QOS_NAME_SIZE 10  
> > > > Noted, will expand to 20 for the moment unless you think it should
> > > > be  
> > > higher?
> > >
> > > I'd rather take that out completely.
> > >  
> > 
> > Done, Daniele had a similar comment and it's not needed.
> >   
> > > > > > +
> > > > > > +/* Action that can be set for a packet for rte_meter */ enum
> > > > > > +egress_policer_action {
> > > > > > +        GREEN = e_RTE_METER_GREEN,
> > > > > > +        DROP = 1,
> > > > > > +};  
> > > > >
> > > > > I don't think it's a good idea to mix external and internal
> > > > > declarations because if the external one changes, it will break
> > > > > here. Since GREEN and DROP have meanings only inside of OVS, we  
> > > should define our own values.  
> > > > > See my comment where this is used.
> > > > >  
> > > > Just to clarify, is this in relation to the use of  
> > e_RTE_METER_GREEN?  
> > > > Is it prefereable for GREEN to be equal to our own declaration?  
> > >
> > > We can use e_RTE_METER_GREEN and we should do it to be consistent with
> > > the rte function.  My concern is having two declarations: one coming
> > > from outside and another that you defined to a fixed number.  So, what
> > > happens if e_RTE_METER_GREEN gets shifted and now its value is 1?
> > >
> > > But with the simplification I proposed for
> > > egress_policer_pkt_handle__() that enum isn't needed and we can take  
> > it out completely.  
> > >  
> > 
> > Agreed, I've tested your suggestion and it works fine + is cleaner.
> >   
> > >  
> > > > > >  static inline void
> > > > > >  netdev_dpdk_vhost_update_tx_counters(struct netdev_stats  
> > *stats,  
> > > > > >                                       struct dp_packet
> > > > > > **packets, @@
> > > > > > -1092,6 +1222,7 @@ __netdev_dpdk_vhost_send(struct netdev
> > > > > > *netdev, int  
> > > > > qid,  
> > > > > >      struct virtio_net *virtio_dev =  
> > > > > netdev_dpdk_get_virtio(vhost_dev);  
> > > > > >      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> > > > > >      unsigned int total_pkts = cnt;
> > > > > > +    unsigned int qos_pkts = cnt;
> > > > > >      uint64_t start = 0;
> > > > > >
> > > > > >      if (OVS_UNLIKELY(!is_vhost_running(virtio_dev))) { @@
> > > > > > -1106,6
> > > > > > +1237,10 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int
> > > > > > +qid,
> > > > > >          rte_spinlock_lock(&vhost_dev->tx_q[qid].tx_lock);
> > > > > >      }
> > > > > >
> > > > > > +    /* Check has QoS has been configured for the netdev */
> > > > > > +    cnt = netdev_dpdk_qos_process__(vhost_dev, cur_pkts, cnt);
> > > > > > +    qos_pkts -= cnt;  
> > > > >
> > > > > I don't see how this is working because
> > > > > egress_policer_alg_process() will loop in the array one by one  
> > freeing or not the packet.  
> > > > > If it does nothing, then the array is good but otherwise it
> > > > > creates a problem because the array still references freed
> > > > > entries.  So in loop below we are sending a mix of good and freed
> > > > > buffers up to  
> > > 'cnt'.  
> > > > >  
> > > > So we should never have freed packets included in cnt as it will
> > > > cause a segfault. Consider when an array of size cnt is passed to
> > > > the netdev_dpdk_qos_process(). The timestamp used to update the
> > > > token bucket is taken before the for loop is entered. This way the
> > > > token bucket is only updated once for each batch of packets of size  
> > cnt.  
> > > > The loop cycles through the array, for any packet that is green the
> > > > token bucket decremented. If a packet is dropped then the bucket has
> > > > been exhausted. The token bucket will not be updated again within
> > > > the run of the loop. Any packets remaining in the array will be
> > > > dropped (i.e. freed) as the bucket is exhausted. The value of cnt
> > > > will only be the number of pkts that were not dropped (i.e. not
> > > > freed). This value will then be passed to the send function ensuring
> > > > no freed packets are present in the array that is to be processed.
> > > > If the token bucket updated between each individual packet being
> > > > processed then you could have the situation you describe, by
> > > > processing them as a batch with a single update beforehand you avoid  
> > it.  
> > >
> > > I haven't looked into the rte implementation yet so I can't tell if
> > > the decision to drop or not the packet is based solely on the
> > > timestamp which doesn't change during the batch processing.
> > >
> > > Assuming that is the case I agree with you that the cnt would have
> > > only the first valid entries to be sent out.
> > >
> > > My concern is when we are processing a batch with two packets.
> > > One big packet and another small one.  While processing the big
> > > packet, there is no enough tokens and the packet is marked as
> > > RED/DROPPED.  Then the loop gets the second small packet, but now some
> > > enough tokens were added/still available in the bucket which passes  
> > the small packet.  
> > >  
> > 
> > Apologies, I was looking at this the wrong way. I understand your point
> > now.
> > Let me think about this. And I'll post a rough solution here before
> > submitting.  
> 
> So there are 4 cases we can have with policing in terms of how packets have been processed.
> 
> (i) The token bucket is not exhausted and no packets are dropped.
> (ii) The token bucket is exhausted while processing and packets are dropped sequentially for the remainder of the burst.
> (iii) The token bucket is exhausted from previous call so all packets are dropped.
> (iv) Packets are not dropped sequentially, could occur with large packet being dropped before smaller packet being processed.
> 
> Currently we handle cases (i)(ii) & (iii). However case (iv) is not handled and will cause a segfault.
> 
> Egress_policer_run() returns a cnt of the number of packets remaining after the policer has run.
> The assumption here currently is that this cnt can be directly mapped to the entries in the rte_mbuff array i.e. 0 to cnt-1 will contain pkts and will not contain NULL entries or packets that have been freed.
> 
> To deal with case (iv) we will have to ensure that the entries in the rte_mbuff array from 0 to cnt-1 contains packets only. This will require the entries in the rte_mbuff array to be re-arranged if 0 to cnt-1 contains freed packets.
> 
> How to do this as optimally as possible becomes the next question.
> 
> A potential/rough solution would be as follows
> 
> We introduce 2 new index values in egress_policer_run().
> Index value 1 'f_dropped' represents the array entry position where the first packet dropped is located in the rte_mbuff array.
> Index position 2 'l_sent' is the entry location of the last packet to be sent in the array.
> 
> Something like the following could be done to check and handle non sequential drops.
> 
> /* Check if drops are not sequential */
> if (f_dropped < l_sent) {
> 
>     /* Must move any packets that are present after drop
>      * so that all packets are sequential in rte_mbuff array 
>      * pkts */
>     for (x = f_dropped; x <= l_sent; x++) {
>         for (y = x+1; y <= l_sent; y++ ) {
>             if (pkts[y] != NULL) {
>                 pkts[x] = pkts[y];
>                 pkts[y] = NULL;
>                 break;
>             }
>             continue;
>         }
>     }    
> }
> /* return cnt i.e. number of packets present in the array */
> return cnt;
> 
> 
> Would something like what was outlined above be ok? Do people have a more optimal solution?

That seems to not work when you have multiple drops in the middle or
maybe I missed something.
What about something like this:

    int cnt = 0;
    int i = 0;
    for (i = 0; i < pkt_cnt; i++) {
        pkt = pkts[i];
        /* Handle current packet */
        if (egress_policer_pkt_handle__(&policer->egress_meter, pkt,
                                        current_time) == GREEN) {
            if (cnt != i) { 
                pkts[cnt] = pkt;
            }
            cnt++;
        }
        else {
            rte_pktmbuf_free(pkt);
        }
    }   
    
    return cnt;

Thanks,
-- 
fbl




More information about the dev mailing list