[ovs-dev] [PATCH RFC 1/1] netdev-dpdk.c: Add ingress-policing functionality.

Stokes, Ian ian.stokes at intel.com
Fri Mar 4 09:33:11 UTC 2016


> -----Original Message-----
> From: Daniele Di Proietto [mailto:diproiettod at vmware.com]
> Sent: Friday, March 04, 2016 12:19 AM
> To: Stokes, Ian
> Cc: dev at openvswitch.org
> Subject: Re: [ovs-dev] [PATCH RFC 1/1] netdev-dpdk.c: Add ingress-
> policing functionality.
> 
> Hi Ian,
> 
> I haven't looked at every detail nor tested the patch yet, I just wanted
> to give some early feedback.
> 
> Thanks,
> 
> Daniele
Thanks for taking the time to give feedback on this Daniele, response inline.

Thanks
Ian
> 
> On 24/02/2016 06:17, "dev on behalf of Ian Stokes"
> <dev-bounces at openvswitch.org on behalf of ian.stokes at intel.com> wrote:
> 
> >This patch provides the modifications required in netdev-dpdk.c and
> >vswitch.xml to enable ingress policing for DPDK interfaces.
> >
> >This patch implements the necessary netdev functions to netdev-dpdk.c
> >as well as various helper functions required for ingress policing.
> >
> >The vswitch.xml has been modified to explain the expected parameters
> >and behaviour when using ingress policing.
> >
> >The INSTALL.DPDK.md guide has been modified to provide an example
> >configuration of ingress policing.
> >
> >Signed-off-by: Ian Stokes <ian.stokes at intel.com>
> >---
> > INSTALL.DPDK.md      |   19 +++++
> > NEWS                 |    1 +
> > lib/netdev-dpdk.c    |  187
> >++++++++++++++++++++++++++++++++++++++++++++++++--
> > vswitchd/vswitch.xml |   24 ++++---
> > 4 files changed, 217 insertions(+), 14 deletions(-)
> >
> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md index ca49106..c42cc8a
> >100644
> >--- a/INSTALL.DPDK.md
> >+++ b/INSTALL.DPDK.md
> >@@ -207,6 +207,25 @@ Using the DPDK with ovs-vswitchd:
> >    ./ovs-ofctl add-flow br0 in_port=2,action=output:1
> >    ```
> >
> >+8. Ingress Policing Example
> >+
> >+   Assuming you have a vhost-user port receiving traffic consisting of
> >+   packets of size 64 bytes, the following command would limit the
> >reception
> >+   rate of the port to ~1,000,000 packets per second:
> >+
> >+   `ovs-vsctl set interface vhost-user0 ingress_policing_rate=46000000
> >+    ingress_policing_burst=4096`
> >+
> >+   To examine the ingress policer configuration of the port:
> >+
> >+   `ovs-vsctl list interface vhost-user0`
> >+
> >+   To clear the ingress policer configuration from the port use the
> >following:
> >+
> >+   `ovs-vsctl set interface vhost-user0 ingress_policing_rate=0`
> >+
> >+   For more details regarding ingress-policer see the vswitch.xml.
> >+
> > Performance Tuning:
> > -------------------
> >
> >diff --git a/NEWS b/NEWS
> >index 57a250e..fb67f86 100644
> >--- a/NEWS
> >+++ b/NEWS
> >@@ -18,6 +18,7 @@ Post-v2.5.0
> >      * New appctl command 'dpif-netdev/pmd-rxq-show' to check the
> >port/rxq
> >        assignment.
> >      * Type of log messages from PMD threads changed from INFO to DBG.
> >+     * Add ingress policing functionality.
> >    - ovs-benchmark: This utility has been removed due to lack of use
> and
> >      bitrot.
> >    - ovs-appctl:
> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >71034a0..faf3583 100644
> >--- a/lib/netdev-dpdk.c
> >+++ b/lib/netdev-dpdk.c
> >@@ -53,6 +53,7 @@
> >
> > #include "rte_config.h"
> > #include "rte_mbuf.h"
> >+#include "rte_meter.h"
> > #include "rte_virtio_net.h"
> >
> > VLOG_DEFINE_THIS_MODULE(dpdk);
> >@@ -193,6 +194,11 @@ struct dpdk_ring {
> >     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);  };
> >
> >+struct ingress_policer {
> >+    struct rte_meter_srtcm_params app_srtcm_params;
> >+    struct rte_meter_srtcm in_policer; };
> >+
> > struct netdev_dpdk {
> >     struct netdev up;
> >     int port_id;
> >@@ -231,6 +237,13 @@ struct netdev_dpdk {
> >     /* Identifier used to distinguish vhost devices from each other */
> >     char vhost_id[PATH_MAX];
> >
> >+    /* Ingress Policer */
> >+    rte_spinlock_t policer_lock;
> >+    struct ingress_policer *ingress_policer;
> >+
> 
> I would prefer not to have a lock at this level.
> 
> I think it would make more sense to make the ingress_policer pointer RCU
> protected and embed the spinlock into struct ingress_policer, to protect
> the token bucket.
> 
Sure I agree, I was modelling this on what we have currently in QoS just for a rough implementation.
I can take a look at using RCU instead. This could possibly be extended to the QoS use case also in the futue.
> 
> >+    uint32_t policer_rate;
> >+    uint32_t policer_burst;
> 
> Why do we need to keep these parameter here?
> Aren't these values kept in rte_meter_srtcm_params?

This is modelled from the netdev-linux.
They are used when creating an ingress policer specifically to see if the same configuration is being set twice.
My thinking here was that if the values were specified in kbits but the rte_meter uses bytes it was better to store the current configuration settings here rather than checking if the meter was configured, retrieving the values and then converting current meter values specified in bytes back to kbits. From your comments below I'll move to using kbits instead of bytes.

I can remove them if people do not think they are a good idea.

> 
> >+
> >     /* In dpdk_list. */
> >     struct ovs_list list_node OVS_GUARDED_BY(dpdk_mutex);  }; @@
> >-617,6 +630,11 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int
> >port_no,
> >     netdev_->requested_n_rxq = NR_QUEUE;
> >     netdev->real_n_txq = NR_QUEUE;
> >
> >+    rte_spinlock_init(&netdev->policer_lock);
> >+    netdev->ingress_policer = NULL;
> >+    netdev->policer_rate = 0;
> >+    netdev->policer_burst = 0;
> >+
> >     if (type == DPDK_DEV_ETH) {
> >         netdev_dpdk_alloc_txq(netdev, NR_QUEUE);
> >         err = dpdk_eth_dev_init(netdev); @@ -1012,12 +1030,14 @@
> >is_vhost_running(struct virtio_net *dev)
> >
> > static inline void
> > netdev_dpdk_vhost_update_rx_counters(struct netdev_stats *stats,
> >-                                     struct dp_packet **packets, int
> >count)
> >+                                     struct dp_packet **packets, int
> >count,
> >+									 int dropped)
> > {
> >     int i;
> >     struct dp_packet *packet;
> >
> >     stats->rx_packets += count;
> >+    stats->rx_dropped += dropped;
> >     for (i = 0; i < count; i++) {
> >         packet = packets[i];
> >
> >@@ -1039,6 +1059,48 @@ netdev_dpdk_vhost_update_rx_counters(struct
> >netdev_stats *stats,
> >     }
> > }
> >
> >+static inline int
> >+policer_pkt_handle__(struct rte_meter_srtcm *meter,
> >+                            struct rte_mbuf *pkt, uint64_t time) {
> >+    uint32_t pkt_len = rte_pktmbuf_pkt_len(pkt) - sizeof(struct
> >ether_hdr);
> >+
> >+    /* color input is not used for blind modes */
> >+    if (rte_meter_srtcm_color_blind_check(meter, time, pkt_len) !=
> >e_RTE_METER_GREEN) {
> >+        return 1;
> >+    }
> >+
> >+    return 0;
> >+}
> 
> Maybe you thought about that, but I think this function is identical to
> egress_policer_pkt_handle__().  I think it's worth renaming that and
> sharing the implementation.

Agreed, as per the cover note I'll make every effort to reduce code duplication for the two in a separate patch before submitting the ingress policer.
> 
> >+
> >+static int
> >+policer_run(struct netdev *netdev_, struct rte_mbuf **pkts,
> >+                        int pkt_cnt)
> >+{
> >+    int i = 0;
> >+    int cnt = 0;
> >+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> >+    struct rte_mbuf *pkt = NULL;
> >+    uint64_t current_time = rte_rdtsc();
> >+
> >+    for (i = 0; i < pkt_cnt; i++) {
> >+        pkt = pkts[i];
> >+        /* Handle current packet */
> >+        if (policer_pkt_handle__(&netdev->ingress_policer->in_policer,
> >pkt,
> >+                                        current_time) != 1) {
> >+            if (cnt != i) {
> >+                pkts[cnt] = pkt;
> >+            }
> >+            cnt++;
> >+        }
> >+        else {
> >+            rte_pktmbuf_free(pkt);
> >+        }
> >+    }
> >+
> >+    return cnt;
> >+}
> 
> Even this is very similar to egress_policer_run(), so it probably makes
> sense to share the implementation.
> 
> I guess this patch is based on a tree where there was no egress
> policing, so maybe you though about this already

Correct, functions will be shared in the final submission.
> 
> >+
> > /*
> >  * The receive path for the vhost port is the TX path out from guest.
> >  */
> >@@ -1052,6 +1114,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq
> *rxq_,
> >     struct virtio_net *virtio_dev = netdev_dpdk_get_virtio(vhost_dev);
> >     int qid = rxq_->queue_id;
> >     uint16_t nb_rx = 0;
> >+    uint16_t dropped = 0;
> >
> >     if (OVS_UNLIKELY(!is_vhost_running(virtio_dev))) {
> >         return EAGAIN;
> >@@ -1069,8 +1132,16 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq
> *rxq_,
> >         return EAGAIN;
> >     }
> >
> >+    rte_spinlock_lock(&vhost_dev->policer_lock);
> >+    if (vhost_dev->ingress_policer != NULL) {
> >+        dropped = nb_rx;
> >+        nb_rx = policer_run(netdev, (struct rte_mbuf **)packets,
> nb_rx);
> >+        dropped -= nb_rx;
> >+    }
> >+    rte_spinlock_unlock(&vhost_dev->policer_lock);
> >+
> >     rte_spinlock_lock(&vhost_dev->stats_lock);
> >-    netdev_dpdk_vhost_update_rx_counters(&vhost_dev->stats, packets,
> >nb_rx);
> >+    netdev_dpdk_vhost_update_rx_counters(&vhost_dev->stats, packets,
> >nb_rx, dropped);
> >     rte_spinlock_unlock(&vhost_dev->stats_lock);
> >
> >     *c = (int) nb_rx;
> >@@ -1085,6 +1156,7 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_,
> >struct dp_packet **packets,
> >     struct netdev *netdev = rx->up.netdev;
> >     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >     int nb_rx;
> >+    int dropped = 0;
> >
> >     /* There is only one tx queue for this core.  Do not flush other
> >      * queues.
> >@@ -1102,6 +1174,21 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_,
> >struct dp_packet **packets,
> >         return EAGAIN;
> >     }
> 
> I would really prefer to avoid taking unnecessary locks in the fast
> path.  If we use RCU for the ingress_policer pointer, like we can avoid
> that.
> 
> I hope that this way we can minimize the overhead for devices with
> policing not enabled.
> 
Agreed, I'll investigate this.
> >
> >+    rte_spinlock_lock(&dev->policer_lock);
> >+    if (dev->ingress_policer != NULL) {
> >+        dropped = nb_rx;
> >+        nb_rx = policer_run(netdev, (struct rte_mbuf **) packets,
> nb_rx);
> >+        dropped -= nb_rx;
> >+    }
> >+    rte_spinlock_unlock(&dev->policer_lock);
> >+
> >+    /* Update stats to reflect dropped packets */
> >+    if (OVS_UNLIKELY(dropped)) {
> >+        rte_spinlock_lock(&dev->stats_lock);
> >+        dev->stats.rx_dropped += dropped;
> >+        rte_spinlock_unlock(&dev->stats_lock);
> >+    }
> >+
> >     *c = nb_rx;
> >
> >     return 0;
> >@@ -1502,12 +1589,12 @@ netdev_dpdk_vhost_get_stats(const struct netdev
> >*netdev,
> >     stats->tx_fifo_errors = UINT64_MAX;
> >     stats->tx_heartbeat_errors = UINT64_MAX;
> >     stats->tx_window_errors = UINT64_MAX;
> >-    stats->rx_dropped += UINT64_MAX;
> >
> >     rte_spinlock_lock(&dev->stats_lock);
> >     /* Supported Stats */
> >     stats->rx_packets += dev->stats.rx_packets;
> >     stats->tx_packets += dev->stats.tx_packets;
> >+    stats->rx_dropped += dev->stats.rx_dropped;
> >     stats->tx_dropped += dev->stats.tx_dropped;
> >     stats->multicast = dev->stats.multicast;
> >     stats->rx_bytes = dev->stats.rx_bytes; @@ -1545,11 +1632,12 @@
> >netdev_dpdk_get_stats(const struct netdev *netdev, struct netdev_stats
> >*stats)
> >
> >     rte_spinlock_lock(&dev->stats_lock);
> >     stats->tx_dropped = dev->stats.tx_dropped;
> >+    stats->rx_dropped = dev->stats.rx_dropped;
> >     rte_spinlock_unlock(&dev->stats_lock);
> >
> >     /* These are the available DPDK counters for packets not received
> >due to
> >      * local resource constraints in DPDK and NIC respectively. */
> >-    stats->rx_dropped = rte_stats.rx_nombuf + rte_stats.imissed;
> >+    stats->rx_dropped += rte_stats.rx_nombuf + rte_stats.imissed;
> >     stats->collisions = UINT64_MAX;
> >
> >     stats->rx_length_errors = UINT64_MAX; @@ -1617,6 +1705,95 @@
> >netdev_dpdk_get_features(const struct netdev *netdev_,  }
> >
> > static int
> >+netdev_dpdk_policer_construct__(struct netdev *netdev_, uint32_t rate,
> >+                                uint32_t burst) {
> >+    struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> >+    struct ingress_policer *policer;
> >+    int err = 0;
> >+
> >+    rte_spinlock_lock(&netdev->policer_lock);
> >+    policer = xmalloc(sizeof *policer);
> >+    netdev->ingress_policer = policer;
> >+
> >+    policer->app_srtcm_params.cir = rate;
> >+    policer->app_srtcm_params.cbs = burst;
> >+    policer->app_srtcm_params.ebs = 0;
> >+    err = rte_meter_srtcm_config(&policer->in_policer,
> >+                                    &policer->app_srtcm_params);
> >+    rte_spinlock_unlock(&netdev->policer_lock);
> >+
> >+    return err;
> >+}
> >+
> >+static int
> >+netdev_dpdk_policer_destruct__(struct netdev *netdev_) {
> >+	struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> >+	if (netdev->ingress_policer == NULL)	{
> >+		return 0;
> >+	}
> >+	else {
> >+        rte_spinlock_lock(&netdev->policer_lock);
> >+        free(netdev->ingress_policer);
> >+        netdev->ingress_policer = NULL;
> >+	    netdev->policer_rate = 0;
> >+        netdev->policer_burst = 0;
> >+        rte_spinlock_unlock(&netdev->policer_lock);
> >+    }
> >+	return 0;
> >+}
> >+
> >+static int
> >+netdev_dpdk_set_policing(struct netdev* netdev_, uint32_t
> policer_rate,
> >+                            uint32_t policer_burst) {
> >+	struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
> >+    int err = 0;
> >+
> >+    policer_burst = (!policer_rate ? 0      /* Force to 0 if no rate
> >specified. */
> >+                    : !policer_burst ? 4096 /* Default to 4096 bytes
> >+ if
> >0. */
> 
> netdev-linux has a much higher default value.
> Any reason you choose this? Just curious
> 
This was just the default value I've used in my tests personally, I'm happy to set it higher though to be the same as netdev-linux.
> >+                    : policer_burst);       /* Stick with user-
> specified
> >value. */
> >+
> >+    ovs_mutex_lock(&netdev->mutex);
> >+
> >+	if (netdev->policer_rate == policer_rate &&
> >+        netdev->policer_burst == policer_burst) {
> >+		/* Assume that settings haven't changed since we last set
> them. */
> >+	    ovs_mutex_unlock(&netdev->mutex);
> >+		return err;
> >+	}
> >+
> >+    /* Destroy any existing ingress policer for the device if one
> exists
> >+     * and the policer_rate and policer_burst are being set to 0
> >+     */
> >+    if ((netdev->ingress_policer != NULL) &&
> >+        (policer_rate == 0) && (policer_burst == 0)) {
> >+
> >+        /* User has set rate and burst to 0, destroy the policer */
> >+		err = netdev_dpdk_policer_destruct__(netdev_);
> >+	    ovs_mutex_unlock(&netdev->mutex);
> >+        return err;
> >+    }
> >+
> >+    /* Destroy existing policer */
> >+    if (netdev->ingress_policer != NULL) {
> >+        err = netdev_dpdk_policer_destruct__(netdev_);
> >+    }
> >+
> >+    /* Add an ingress policer */
> >+    err = netdev_dpdk_policer_construct__(netdev_, policer_rate,
> >policer_burst);
> >+
> >+    /* Update policer_rate and policer_burst for the netdev */
> >+    netdev->policer_rate = policer_rate;
> >+    netdev->policer_burst = policer_burst;
> >+
> >+    ovs_mutex_unlock(&netdev->mutex);
> >+
> >+	return err;
> >+}
> >+
> >+static int
> > netdev_dpdk_get_ifindex(const struct netdev *netdev)  {
> >     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); @@ -2193,7
> >+2370,7 @@ unlock_dpdk:
> >     GET_FEATURES,                                             \
> >     NULL,                       /* set_advertisements */      \
> >                                                               \
> >-    NULL,                       /* set_policing */            \
> >+    netdev_dpdk_set_policing,   /* set_policing */            \
> >     NULL,                       /* get_qos_types */           \
> >     NULL,                       /* get_qos_capabilities */    \
> >     NULL,                       /* get_qos */                 \
> >diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index
> >c2ec914..4ef9de9 100644
> >--- a/vswitchd/vswitch.xml
> >+++ b/vswitchd/vswitch.xml
> >@@ -2384,8 +2384,8 @@
> >         table="Queue"/> tables).
> >       </p>
> >       <p>
> >-        Policing is currently implemented only on Linux.  The Linux
> >-        implementation uses a simple ``token bucket'' approach:
> >+        Policing is implemented on Linux and DPDK interfaces. Both
> >+        implementations use a simple ``token bucket'' approach:
> >       </p>
> >       <ul>
> >         <li>
> >@@ -2422,17 +2422,23 @@
> >       </p>
> >       <column name="ingress_policing_rate">
> >         <p>
> 
> I don't see why we need to use a different unit with DPDK devices.
> 
> Also, the comments in netdev-provider.h and netdev.c for set_policing()
> clearly specify kbits, so those are not honored by this implementation.
> 
Sure, originally I was setting the values in kbits from the command line then handling the conversion to bytes under the hood for the rte_meter. I changed it back for the RFC to bytes because I realized we now had an egress policer specified in bytes and an ingress policer specified in kbits. My thinking was that this could create confusion for end users so for the RFC I left it as bytes.

I think you're correct though, it should be kbits to keep with netdev.c and netdev-provider.h. This will also remove the changes for the vswitch.xml.
> 
> >-          Maximum rate for data received on this interface, in kbps.
> >Data
> >-          received faster than this rate is dropped.  Set to
> ><code>0</code>
> >-          (the default) to disable policing.
> >+          Maximum rate for data received on this interface. The metric
> >used
> >+          for Linux and DPDK interfaces differ. Linux interfaces
> >+ expect
> >the
> >+          value to be specified in kbps. DPDK interfaces expect the
> >value to
> >+          be specified in bytes per second. Data received faster than
> >this
> >+          rate is dropped. Set to <code>0</code>(the default) to
> disable
> >+          policing for both interface types.
> >         </p>
> >       </column>
> >
> >       <column name="ingress_policing_burst">
> >-        <p>Maximum burst size for data received on this interface, in
> >kb.  The
> >-        default burst size if set to <code>0</code> is 1000 kb.  This
> >value
> >-        has no effect if <ref column="ingress_policing_rate"/>
> >-        is <code>0</code>.</p>
> >+        <p>Maximum burst size for data received on this interface. The
> >metric
> >+        used for Linux and DPDK interfaces differ. Linux interfaces
> >expect a
> >+        value specified in kb. DPDK interfaces expect a value
> >+ specified
> >in
> >+        bytes. The default burst size if set to <code>0</code> is 1000
> >kb for
> >+        Linux interfaces and 4096 bytes for DPDK. This value has no
> >effect if
> >+        <ref column="ingress_policing_rate"/> is <code>0</code> for
> >either
> >+        interface type.</p>
> >         <p>
> >           Specifying a larger burst size lets the algorithm be more
> >forgiving,
> >           which is important for protocols like TCP that react
> >severely to
> >--
> >1.7.4.1
> >
> >_______________________________________________
> >dev mailing list
> >dev at openvswitch.org
> >http://openvswitch.org/mailman/listinfo/dev




More information about the dev mailing list