[ovs-dev] [RFC] dpif-netdev: only poll enabled vhost queues

Kevin Traynor ktraynor at redhat.com
Tue Apr 9 14:49:12 UTC 2019


On 09/04/2019 12:41, Ilya Maximets wrote:
> On 09.04.2019 11:34, Ilya Maximets wrote:
>> On 08.04.2019 16:44, David Marchand wrote:
>>> Hello Ilya,
>>>
>>> On Mon, Apr 8, 2019 at 10:27 AM Ilya Maximets <i.maximets at samsung.com <mailto:i.maximets at samsung.com>> wrote:
>>>
>>>     On 04.04.2019 22:49, David Marchand wrote:
>>>     > We tried to lower the number of rebalances but we don't have a
>>>     > satisfying solution at the moment, so this patch rebalances on each
>>>     > update.
>>>
>>>     Hi.
>>>
>>>     Triggering the reconfiguration on each vring state change is a bad thing.
>>>     This could be abused by the guest to break the host networking by infinite
>>>     disabling/enabling queues. Each reconfiguration leads to removing ports
>>>     from the PMD port caches and their reloads. On rescheduling all the ports
>>>
>>>
>>> I'd say the reconfiguration itself is not wanted here.
>>> Only rebalancing the queues would be enough.
>>
>> As you correctly mentioned in commit message where will be no real port
>> configuration changes.
>>
>> Under 'reconfiguration' I mean datapath reconfiguration and 'rebalancing'
>> is one of its stages.
>>
>>>
>>>
>>>     could be moved to a different PMD threads resulting in EMC/SMC/dpcls
>>>     invalidation and subsequent upcalls/packet reorderings.
>>>
>>>
>>> I agree that rebalancing does trigger EMC/SMC/dpcls invalidation when moving queues.
>>> However, EMC/SMC/dpcls are per pmd specific, where would we have packet reordering ?
>>
>> Rx queue could be scheduled to different PMD thread --> new packets will go to different
>> Tx queue. It's unlikely, but it will depend on device/driver which packets will be sent
>> first. The main issue here that it happens to other ports, not only to port we're trying
>> to reconfigure.
>>
>>>
>>>
>>>
>>>     Same issues was discussed previously while looking at possibility of
>>>     vhost-pmd integration (with some test results):
>>>     https://mail.openvswitch.org/pipermail/ovs-dev/2016-August/320430.html
>>>
>>>
>>> Thanks for the link, I will test this.
>>>
>>>
>>>
>>>     One more reference:
>>>     7f5f2bd0ce43 ("netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.")
>>>
>>>
>>> Yes, I saw this patch.
>>> Are we safe against guest drivers/applications that play with VIRTIO_NET_F_MQ, swapping it continuously ?
>>
>> Good point. I didn't test that, but it looks like we're not safe here.
>> Kernel and DPDK drivers has F_MQ enabled by default so it'll require
>> some changes/explicit disabling. But I agree that this could produce
>> issues if someone will do that.
>>
>> We could probably handle this using 'max seen qp_num' approach. But I'm
>> not a fan of this. Need to figure out how to do this correctly.
>>
>> In general, I think, that we should not allow cases where guest is able
>> to manipulate the host configuration.
>>
>>>
>>>
>>>
>>>
>>>     Anyway, do you have some numbers of how much time PMD thread spends on polling
>>>     disabled queues? What the performance improvement you're able to achieve by
>>>     avoiding that?
>>>
>>>
>>> With a simple pvp setup of mine.
>>> 1c/2t poll two physical ports.
>>> 1c/2t poll four vhost ports with 16 queues each.
>>>   Only one queue is enabled on each virtio device attached by the guest.
>>>   The first two virtio devices are bound to the virtio kmod.
>>>   The last two virtio devices are bound to vfio-pci and used to forward incoming traffic with testpmd.
>>>
>>> The forwarding zeroloss rate goes from 5.2Mpps (polling all 64 vhost queues) to 6.2Mpps (polling only the 4 enabled vhost queues).
>>
>> That's interesting. However, this doesn't look like a realistic scenario.
>> In practice you'll need much more PMD threads to handle so many queues.
>> If you'll add more threads, zeroloss test could show even worse results
>> if one of idle VMs will periodically change the number of queues. Periodic
>> latency spikes will cause queue overruns and subsequent packet drops on
>> hot Rx queues. This could be partially solved by allowing n_rxq to grow only.
>> However, I'd be happy to have different solution that will not hide number
>> of queues from the datapath.
> 
> What do you think about something like this (not even compile tested):
> ---
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index bd9718824..0cf492a3b 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -591,6 +591,8 @@ struct polled_queue {
>      struct dp_netdev_rxq *rxq;
>      odp_port_t port_no;
>      bool emc_enabled;
> +    bool enabled;
> +    uint64_t change_seq;
>  };
>  
>  /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> @@ -5375,6 +5377,9 @@ pmd_load_queues_and_ports(struct dp_netdev_pmd_thread *pmd,
>          poll_list[i].rxq = poll->rxq;
>          poll_list[i].port_no = poll->rxq->port->port_no;
>          poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> +        poll_list[i].enabled = netdev_rxq_enabled(poll->rxq);
> +        poll_list[i].change_seq =
> +                     netdev_get_change_seq(poll->rxq->port->netdev);
>          i++;
>      }
>  
> @@ -5440,6 +5445,10 @@ reload:
>  
>          for (i = 0; i < poll_cnt; i++) {
>  
> +            if (!poll_list[i].enabled) {
> +                continue;
> +            }
> +

I had similar thought that doing something like this might be a good
compromise, but wonder about the possible delays in polling newly
enabled queues.

The state could just be checked directly here,

+            if (!netdev_rxq_enabled(poll_list[i].rxq)) {
+                    continue;
+            }

or if it is too much cost, then only check for disabled to catch the
disable->enable case early. DPDK will handle enable->disable case so we
could avoid the check here and do a slower update when lc > 1024.

+            if (!poll_list[i].enabled) {
+                if (OVS_LIKELY(!netdev_rxq_enabled(poll_list[i].rxq)))
+                    continue;
+                else {
+                    poll_list[i].enabled = true;
+                }
+            }

of course if they cost almost as much as letting DPDK do the checks in
the first place, then they are not worth it.

>              if (poll_list[i].emc_enabled) {
>                  atomic_read_relaxed(&pmd->dp->emc_insert_min,
>                                      &pmd->ctx.emc_insert_min);
> @@ -5476,6 +5485,15 @@ reload:
>              if (reload) {
>                  break;
>              }
> +
> +            for (i = 0; i < poll_cnt; i++) {
> +                uint64_t current_seq =
> +                         netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> +                if (poll_list[i].change_seq != current_seq) {
> +                    poll_list[i].change_seq = current_seq;
> +                    poll_list[i].enabled = netdev_rxq_enabled(poll_list[i].rxq);
> +                }
> +            }
>          }
>          pmd_perf_end_iteration(s, rx_packets, tx_packets,
>                                 pmd_perf_metrics_enabled(pmd));
> ---
> + replacing of your 'netdev_request_reconfigure(&dev->up)' inside 'vring_state_changed'
> with 'netdev_change_seq_changed(&dev->up)'?
> 
> This should help to not reconfigure/reschedule everything and not waste much time on
> polling. Also this will give ability to faster react on queue state changes.
> 
> After that we'll be able to fix reconfiguration on F_MQ manipulations with something
> simple like that:
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -3492,8 +3528,8 @@ new_device(int vid)
>                  newnode = dev->socket_id;
>              }
>  
> -            if (dev->requested_n_txq != qp_num
> -                || dev->requested_n_rxq != qp_num
> +            if (dev->requested_n_txq < qp_num
> +                || dev->requested_n_rxq < qp_num
>                  || dev->requested_socket_id != newnode) {
>                  dev->requested_socket_id = newnode;
>                  dev->requested_n_rxq = qp_num;
> ---
> Decreasing of 'qp_num' will not cause big issues because we'll not poll
> disabled queues.
> 
> And there could be one separate change to drop the requested values if socket connection
> closed (I hope that guest is not able to control that):
> ---
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 91af377e8..1937561cc 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -185,13 +185,16 @@ static const struct rte_eth_conf port_conf = {
>   */
>  static int new_device(int vid);
>  static void destroy_device(int vid);
> +static void destroy_connection(int vid);
>  static int vring_state_changed(int vid, uint16_t queue_id, int enable);
>  static const struct vhost_device_ops virtio_net_device_ops =
>  {
>      .new_device =  new_device,
>      .destroy_device = destroy_device,
>      .vring_state_changed = vring_state_changed,
> -    .features_changed = NULL
> +    .features_changed = NULL,
> +    .new_connection = NULL,
> +    .destroy_connection = destroy_connection
>  };
>  
>  enum { DPDK_RING_SIZE = 256 };
> @@ -3462,6 +3465,39 @@ netdev_dpdk_remap_txqs(struct netdev_dpdk *dev)
>      free(enabled_queues);
>  }
>  
> +/*
> + * vhost-user socket connection is closed.
> + */
> +static void
> +destroy_connection(int vid)
> +{
> +    struct netdev_dpdk *dev;
> +    char ifname[IF_NAME_SZ];
> +
> +    rte_vhost_get_ifname(vid, ifname, sizeof ifname);
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    LIST_FOR_EACH(dev, list_node, &dpdk_list) {
> +        ovs_mutex_lock(&dev->mutex);
> +        if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> +            uint32_t qp_num = NR_QUEUE;
> +
> +            /* Restore the number of queue pairs to default. */
> +            if (dev->requested_n_txq != qp_num
> +                || dev->requested_n_rxq != qp_num) {
> +                dev->requested_n_rxq = qp_num;
> +                dev->requested_n_txq = qp_num;
> +                netdev_request_reconfigure(&dev->up);
> +            }
> +            ovs_mutex_unlock(&dev->mutex);
> +            break;
> +        }
> +        ovs_mutex_unlock(&dev->mutex);
> +    }
> +    ovs_mutex_unlock(&dpdk_mutex);
> +}
> +
> +
>  /*
>   * A new virtio-net device is added to a vhost port.
>   */
> ---
> 
> Best regards, Ilya Maximets.
> 



More information about the dev mailing list