[ovs-dev] [PATCH v4] Upcall/Slowpath rate-limiter for OVS

Thu Jun 7 13:36:08 UTC 2018

Manohar Krishnappa Chidambaraswamy
<manohar.krishnappa.chidambaraswamy at ericsson.com> writes:

> Jan,
>
> Have addressed your comments/suggestions. Please take a look and let me
> know if you have any more comments.
>
> Signed-off-by: Manohar K C
> <manohar.krishnappa.chidambaraswamy at ericsson.com>
> CC: Jan Scheurich <jan.scheurich at ericsson.com>
> ---

Hi Jan, and Manohar,

Have you considered making this token bucket per-port instead of
per-pmd?  As I read it, a greedy port can exhaust all the tokens from a
particular PMD, possibly leading to an unfair performance for that PMD
thread.  Am I just being overly paranoid?

How are you stress testing this?  Is there some framework you're using?

I'm interested in possibly implementing a similar rate-limiter from the
kernel side (for parity), but want to see if it's really a problem
with the netlink datapath first.

-Aaron

> v3 : v2 rebased to master and adpated to dpif-netdev-perf counters.
>      https://patchwork.ozlabs.org/patch/909676/
>
> v2 : v1 rebased to master.
>      https://patchwork.ozlabs.org/patch/860687/
>
> v1 : Initial patch for upcall rate-limiter based on token-bucket.
>      https://patchwork.ozlabs.org/patch/836737/
>
>  Documentation/howto/dpdk.rst |  53 ++++++++++++++++++
>  lib/dpif-netdev-perf.h       |   1 +
>  lib/dpif-netdev.c            | 130 ++++++++++++++++++++++++++++++++++++++++---
>  lib/dpif.h                   |   1 +
>  vswitchd/vswitch.xml         |  42 ++++++++++++++
>  5 files changed, 220 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
> index 380181d..b717804 100644
> --- a/Documentation/howto/dpdk.rst
> +++ b/Documentation/howto/dpdk.rst
> @@ -358,6 +358,59 @@ devices to bridge ``br0``. Once complete, follow the below steps:
>  
>         $ cat /proc/interrupts | grep virtio
>  
> +Upcall rate limiting
> +--------------------
> +In OVS, when an incoming packet does not match any flow in the flow
> +table maintained in fast-path (either in userspace or kernel), the
> +packet is processed in the slow-path via a mechanism known as upcall.
> +In OVS-DPDK, both fast-path and slow-path execute in the context of a
> +common poll-mode-driver (PMD) thread, without any partitioning of CPU
> +cycles between the two.
> +
> +When there is a burst of new flows coming into the data-path, packets
> +are punted to slow-path in the order they are received and the PMD is
> +busy for the duration of the upcall. Slow-path processing of a packet
> +consumes 100-200 times the cycles of fast-path handling. As a result,
> +the forwarding performance of a PMD degrades significantly during an
> +upcall burst. If the PMD was highly loaded already, it becomes temporarily
> +overloaded and its rx queues start filling up. If the upcall burst is long
> +enough, packets will be dropped when rx queues are full. This happens even
> +if the new flows are unexpected and the slow-path decides to drop the
> +packets.
> +
> +It is likely that most of the packets dropped due to rx queue overflow
> +belong to established flows that should have been processed by the
> +fast-path. Hence, the current OVS-DPDK architecture favors the handling
> +of new flows over the forwarding of established flows. This is generally
> +a sub-optimal approach.
> +
> +Without a limit to the rate of upcalls, OVS-DPDK is vulnerable for DoS
> +attacks. But even sporadic bursts of e.g. unexpected multicast packets
> +have shown to cause such packet drops.
> +
> +ovs-vsctl can be used to enable and configure upcall rate limit parameters.
> +There are 2 configurable parameters ``upcall-rate`` and ``upcall-burst`` which
> +take effect when the configuration parameter ``upcall-rl`` is set to true.
> +
> +Upcall rate-limiting is done at independent PMD level. Configured values for
> +``upcall-rate`` and ``upcall-burst`` are used indepdently in each PMD
> +(and non-PMD) threads which execute upcalls.
> +
> +Upcall rate should be set using ``upcall-rate`` in upcalls-per-sec. If not
> +configured, default value of 500 upcalls-per-sec will be set. For example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rate=2000
> +
> +Upcall burst size should be set using ``upcall-burst`` in upcalls. If not
> +configured, default value of 500 upcalls will be set. For example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-burst=2000
> +
> +Upcall ratelimit feature should be globally enabled using ``upcall-rl``. For
> +example::
> +
> +    $ ovs-vsctl set Open_vSwitch . other_config:upcall-rl=true
> +
>  Further Reading
>  ---------------
>  
> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
> index 5993c25..e4b945a 100644
> --- a/lib/dpif-netdev-perf.h
> +++ b/lib/dpif-netdev-perf.h
> @@ -55,6 +55,7 @@ enum pmd_stat_type {
>                               * number of packet passes through the datapath
>                               * pipeline and should not be overlapping with each
>                               * other. */
> +    PMD_STAT_RL_DROP,       /* Packets dropped due to upcall rate-limit. */
>      PMD_STAT_MASKED_LOOKUP, /* Number of subtable lookups for flow table
>                                 hits. Each MASKED_HIT hit will have >= 1
>                                 MASKED_LOOKUP(s). */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index be31fd0..dba8629 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -101,6 +101,11 @@ static struct shash dp_netdevs OVS_GUARDED_BY(dp_netdev_mutex)
>  
>  static struct vlog_rate_limit upcall_rl = VLOG_RATE_LIMIT_INIT(600, 600);
>  
> +/* These default values may be tuned based on upcall performance */
> +#define UPCALL_RATELIMIT_DEFAULT false /* Disabled by default */
> +#define UPCALL_RATE_DEFAULT      500  /* upcalls-per-sec */
> +#define UPCALL_BURST_DEFAULT     500  /* maximum burst of upcalls */
> +
>  #define DP_NETDEV_CS_SUPPORTED_MASK (CS_NEW | CS_ESTABLISHED | CS_RELATED \
>                                       | CS_INVALID | CS_REPLY_DIR | CS_TRACKED \
>                                       | CS_SRC_NAT | CS_DST_NAT)
> @@ -316,6 +321,12 @@ struct dp_netdev {
>      uint64_t last_tnl_conf_seq;
>  
>      struct conntrack conntrack;
> +
> +    /* Upcall rate-limiter parameters */
> +    atomic_bool upcall_ratelimit;
> +    atomic_uint32_t upcall_rate;
> +    atomic_uint32_t upcall_burst;
> +    bool upcall_params_changed;
>  };
>  
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -615,6 +626,11 @@ struct dp_netdev_pmd_thread {
>      /* Keep track of detailed PMD performance statistics. */
>      struct pmd_perf_stats perf_stats;
>  
> +    /* Policer to rate limit upcalls */
> +    struct token_bucket upcall_tb;  /* Cache of rate and burst parameters. */
> +    bool upcall_ratelimit;          /* Cache of global enable/disable
> +                                       parameter. */
> +
>      /* Set to true if the pmd thread needs to be reloaded. */
>      bool need_reload;
>  };
> @@ -856,12 +872,13 @@ pmd_info_show_stats(struct ds *reply,
>                    "\tavg. subtable lookups per megaflow hit: %.02f\n"
>                    "\tmiss with success upcall: %"PRIu64"\n"
>                    "\tmiss with failed upcall: %"PRIu64"\n"
> +                  "\tdrops due to upcall ratelimiter: %"PRIu64"\n"
>                    "\tavg. packets per output batch: %.02f\n",
>                    total_packets, stats[PMD_STAT_RECIRC],
>                    passes_per_pkt, stats[PMD_STAT_EXACT_HIT],
>                    stats[PMD_STAT_MASKED_HIT], lookups_per_hit,
>                    stats[PMD_STAT_MISS], stats[PMD_STAT_LOST],
> -                  packets_per_batch);
> +                  stats[PMD_STAT_RL_DROP], packets_per_batch);
>  
>      if (total_cycles == 0) {
>          return;
> @@ -1479,6 +1496,7 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
>          stats->n_hit += pmd_stats[PMD_STAT_MASKED_HIT];
>          stats->n_missed += pmd_stats[PMD_STAT_MISS];
>          stats->n_lost += pmd_stats[PMD_STAT_LOST];
> +        stats->n_rl_dropped += pmd_stats[PMD_STAT_RL_DROP];
>      }
>      stats->n_masks = UINT32_MAX;
>      stats->n_mask_hit = UINT64_MAX;
> @@ -1489,11 +1507,24 @@ dpif_netdev_get_stats(const struct dpif *dpif, struct dpif_dp_stats *stats)
>  static void
>  dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
>  {
> +    uint32_t rate, burst;
> +
>      if (pmd->core_id == NON_PMD_CORE_ID) {
>          ovs_mutex_lock(&pmd->dp->non_pmd_mutex);
>          ovs_mutex_lock(&pmd->port_mutex);
>          pmd_load_cached_ports(pmd);
>          ovs_mutex_unlock(&pmd->port_mutex);
> +
> +        /* Reconfig the upcall policer if params have changed */
> +        atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
> +        atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
> +        if ((rate != pmd->upcall_tb.rate) ||
> +            (burst != pmd->upcall_tb.burst)) {
> +            token_bucket_init(&pmd->upcall_tb, rate, burst);
> +        }
> +        atomic_read_relaxed(&pmd->dp->upcall_ratelimit,
> +                            &pmd->upcall_ratelimit);
> +
>          ovs_mutex_unlock(&pmd->dp->non_pmd_mutex);
>          return;
>      }
> @@ -2987,6 +3018,9 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
>      uint32_t insert_min, cur_min;
>      uint32_t tx_flush_interval, cur_tx_flush_interval;
> +    uint32_t rate, cur_rate;
> +    uint32_t burst, cur_burst;
> +    bool ratelimit, cur_ratelimit;
>  
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>                                       DEFAULT_TX_FLUSH_INTERVAL);
> @@ -3021,6 +3055,44 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>          }
>      }
>  
> +    /* Handle upcall policer params */
> +    ratelimit = smap_get_bool(other_config, "upcall-rl",
> +                              UPCALL_RATELIMIT_DEFAULT);
> +    rate = smap_get_int(other_config, "upcall-rate",
> +                        UPCALL_RATE_DEFAULT);
> +    burst = smap_get_int(other_config, "upcall-burst",
> +                         UPCALL_BURST_DEFAULT);
> +
> +    atomic_read_relaxed(&dp->upcall_ratelimit, &cur_ratelimit);
> +    atomic_read_relaxed(&dp->upcall_rate, &cur_rate);
> +    atomic_read_relaxed(&dp->upcall_burst, &cur_burst);
> +
> +    if ((rate != cur_rate) || (burst != cur_burst)) {
> +        VLOG_INFO("Upcall ratelimiter params changed : Old - rate=%d burst=%d "
> +                  ": New - rate=%d burst=%d\n", cur_rate, cur_burst,
> +                  rate, burst);
> +
> +        atomic_store_relaxed(&dp->upcall_rate, rate);
> +        atomic_store_relaxed(&dp->upcall_burst, burst);
> +
> +        dp->upcall_params_changed = true;
> +    }
> +
> +    if (ratelimit != cur_ratelimit) {
> +        VLOG_INFO("Upcall ratelimiter changed to %s\n",
> +                  (ratelimit ? "Enabled" : "Disabled"));
> +        VLOG_INFO("Upcall ratelimiter params : rate=%d burst=%d\n",
> +                  rate, burst);
> +
> +        atomic_store_relaxed(&dp->upcall_ratelimit, ratelimit);
> +
> +        dp->upcall_params_changed = true;
> +    }
> +
> +    if (dp->upcall_params_changed) {
> +        dp_netdev_request_reconfigure(dp);
> +    }
> +
>      return 0;
>  }
>  
> @@ -3653,6 +3725,7 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>      struct hmapx_node *node;
>      bool changed = false;
>      bool need_to_adjust_static_tx_qids = false;
> +    bool need_to_reinit_upcall_ratelimiter = false;
>  
>      /* The pmd threads should be started only if there's a pmd port in the
>       * datapath.  If the user didn't provide any "pmd-cpu-mask", we start
> @@ -3674,6 +3747,13 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>          need_to_adjust_static_tx_qids = true;
>      }
>  
> +    /* Check if upcall ratelimiter paramters have changed */
> +    if (dp->upcall_params_changed) {
> +        need_to_reinit_upcall_ratelimiter = true;
> +
> +        dp->upcall_params_changed = false;
> +    }
> +
>      /* Check for unwanted pmd threads */
>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>          if (pmd->core_id == NON_PMD_CORE_ID) {
> @@ -3682,7 +3762,8 @@ reconfigure_pmd_threads(struct dp_netdev *dp)
>          if (!ovs_numa_dump_contains_core(pmd_cores, pmd->numa_id,
>                                                      pmd->core_id)) {
>              hmapx_add(&to_delete, pmd);
> -        } else if (need_to_adjust_static_tx_qids) {
> +        } else if (need_to_adjust_static_tx_qids ||
> +                   need_to_reinit_upcall_ratelimiter) {
>              pmd->need_reload = true;
>          }
>      }
> @@ -4106,6 +4187,7 @@ pmd_thread_main(void *f_)
>      int poll_cnt;
>      int i;
>      int process_packets = 0;
> +    uint32_t rate, burst;
>  
>      poll_list = NULL;
>  
> @@ -4135,6 +4217,15 @@ reload:
>          lc = UINT_MAX;
>      }
>  
> +    /* Reconfig upcall policer token bucket with configured params. */
> +    atomic_read_relaxed(&pmd->dp->upcall_rate, &rate);
> +    atomic_read_relaxed(&pmd->dp->upcall_burst, &burst);
> +    if ((rate != pmd->upcall_tb.rate) ||
> +        (burst != pmd->upcall_tb.burst)) {
> +        token_bucket_init(&pmd->upcall_tb, rate, burst);
> +    }
> +    atomic_read_relaxed(&pmd->dp->upcall_ratelimit, &pmd->upcall_ratelimit);
> +
>      pmd->intrvl_tsc_prev = 0;
>      atomic_store_relaxed(&pmd->intrvl_cycles, 0);
>      cycles_counter_update(s);
> @@ -4596,6 +4687,8 @@ static void
>  dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>                          unsigned core_id, int numa_id)
>  {
> +    uint32_t rate, burst;
> +
>      pmd->dp = dp;
>      pmd->core_id = core_id;
>      pmd->numa_id = numa_id;
> @@ -4627,6 +4720,13 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>          emc_cache_init(&pmd->flow_cache);
>          pmd_alloc_static_tx_qid(pmd);
>      }
> +
> +    /* Initialize upcall policer token bucket with configured params */
> +    atomic_read_relaxed(&dp->upcall_rate, &rate);
> +    atomic_read_relaxed(&dp->upcall_burst, &burst);
> +    token_bucket_init(&pmd->upcall_tb, rate, burst);
> +    atomic_read_relaxed(&dp->upcall_ratelimit, &pmd->upcall_ratelimit);
> +
>      pmd_perf_stats_init(&pmd->perf_stats);
>      cmap_insert(&dp->poll_threads, CONST_CAST(struct cmap_node *, &pmd->node),
>                  hash_int(core_id, 0));
> @@ -5145,6 +5245,7 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>      struct dpcls_rule *rules[PKT_ARRAY_SIZE];
>      struct dp_netdev *dp = pmd->dp;
>      int upcall_ok_cnt = 0, upcall_fail_cnt = 0;
> +    int upcall_rl_drop_cnt = 0;
>      int lookup_cnt = 0, add_lookup_cnt;
>      bool any_miss;
>  
> @@ -5185,13 +5286,26 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                  continue;
>              }
>  
> -            int error = handle_packet_upcall(pmd, packet, &keys[i],
> -                                             &actions, &put_actions);
> +            /*
> +             * Grab a token from the upcall policer to enter slowpath. If token
> +             * is not available, drop and account the packet. This is to
> +             * rate-limit packets getting into slowpath.
> +             */
> +            if (pmd->upcall_ratelimit &&
> +                !token_bucket_withdraw(&pmd->upcall_tb, 1)) {
> +                dp_packet_delete(packet);
>  
> -            if (OVS_UNLIKELY(error)) {
> -                upcall_fail_cnt++;
> +                upcall_rl_drop_cnt++;
>              } else {
> -                upcall_ok_cnt++;
> +
> +                int error = handle_packet_upcall(pmd, packet, &keys[i],
> +                                                 &actions, &put_actions);
> +
> +                if (OVS_UNLIKELY(error)) {
> +                    upcall_fail_cnt++;
> +                } else {
> +                    upcall_ok_cnt++;
> +                }
>              }
>          }
>  
> @@ -5228,6 +5342,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                              upcall_ok_cnt);
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_LOST,
>                              upcall_fail_cnt);
> +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RL_DROP,
> +                            upcall_rl_drop_cnt);
>  }
>  
>  /* Packets enter the datapath from a port (or from recirculation) here.
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 94f89ec..708e798 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -438,6 +438,7 @@ struct dpif_dp_stats {
>      uint64_t n_hit;             /* Number of flow table matches. */
>      uint64_t n_missed;          /* Number of flow table misses. */
>      uint64_t n_lost;            /* Number of misses not sent to userspace. */
> +    uint64_t n_rl_dropped;      /* Number of drops due to upcall rate-limit. */
>      uint64_t n_flows;           /* Number of flows present. */
>      uint64_t n_mask_hit;        /* Number of mega flow masks visited for
>                                     flow table matches. */
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 94a64dd..858ba4e 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -428,6 +428,48 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="upcall-rl"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          Set this value to <code>true</code> to enable upcall rate-limiting.
> +          The upcall parameters like rate and burst will be ignored, if this is
> +          not set.
> +        </p>
> +        <p>
> +          The default value is <code>false</code> and upcall rate-limiting will
> +          be disabled.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="upcall-rate"
> +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
> +        <p>
> +          When upcall rate limiting is enabled, this specifies the sustained
> +          rate of upcalls in upcalls per second that are admitted per
> +          packet-polling thread (PMD or non-PMD) in the netdev datapath when
> +          upcall rate limiting is enabled. The default value is 500 upcalls
> +          per second.
> +
> +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
> +          upcalls in the netdev datapath. This can be used for debugging
> +          purposes.
> +        </p>
> +      </column>
> +
> +      <column name="other_config" key="upcall-burst"
> +        type='{"type": "integer", "minInteger": 0, "maxInteger": 4294967295}'>
> +        <p>
> +          When upcall rate limiting is enabled, this specifies the maximum
> +          burst of upcalls that are admitted per packet-polling  thread
> +          (PMD or non-PMD) in the netdev datapath independent from their
> +          packet arrival rate. The default value is a burst of 500 upcalls.
> +
> +          Setting both upcall-rate and upcall-burst to <code>0</code> disables
> +          upcalls in the netdev datapath. This can be used for debugging
> +          purposes.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="vlan-limit"
>                type='{"type": "integer", "minInteger": 0}'>
>          <p>