[ovs-dev] [PATCH v3 4/6] dpif-netdev: Change rxq_scheduling to use rxq processing cycles.

Wed Aug 9 15:47:07 UTC 2017

On 08/08/2017 07:15 PM, Greg Rose wrote:
> On 08/01/2017 08:58 AM, Kevin Traynor wrote:
>> Previously rxqs were assigned to pmds by round robin in
>> port/queue order.
>>
>> Now that we have the processing cycles used for existing rxqs,
>> use that information to try and produced a better balanced
>> distribution of rxqs across pmds. i.e. given multiple pmds, the
>> rxqs which have consumed the largest amount of processing cycles
>> will be placed on different pmds.
>>
>> The rxqs are sorted by their processing cycles and assigned (in
>> sorted order) round robin across pmds.
>>
>> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
>> ---
>>   Documentation/howto/dpdk.rst |  7 +++++
>>   lib/dpif-netdev.c            | 73
>> +++++++++++++++++++++++++++++++++++---------
>>   2 files changed, 66 insertions(+), 14 deletions(-)
>>
>> diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst
>> index af01d3e..a969285 100644
>> --- a/Documentation/howto/dpdk.rst
>> +++ b/Documentation/howto/dpdk.rst
>> @@ -119,4 +119,11 @@ After that PMD threads on cores where RX queues
>> was pinned will become
>>     thread.
>>
>> +If pmd-rxq-affinity is not set for rxqs, they will be assigned to
>> pmds (cores)
>> +automatically. The processing cycles that have been required for each
>> rxq
>> +will be used where known to assign rxqs with the highest consumption of
>> +processing cycles to different pmds.
>> +
>> +Rxq to pmds assignment takes place whenever there are configuration
>> changes.
>> +
>>   QoS
>>   ---
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 25a521a..a05e586 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -3295,8 +3295,29 @@ rr_numa_list_destroy(struct rr_numa_list *rr)
>>   }
>>
>> +/* Sort Rx Queues by the processing cycles they are consuming. */
>> +static int
>> +rxq_cycle_sort(const void *a, const void *b)
>> +{
>> +    struct dp_netdev_rxq * qa;
>> +    struct dp_netdev_rxq * qb;
>> +
>> +    qa = *(struct dp_netdev_rxq **) a;
>> +    qb = *(struct dp_netdev_rxq **) b;
>> +
>> +    if (dp_netdev_rxq_get_cycles(qa, RXQ_CYCLES_PROC_LAST) >=
>> +            dp_netdev_rxq_get_cycles(qb, RXQ_CYCLES_PROC_LAST)) {
>> +        return -1;
>> +    }
>> +
>> +    return 1;
>> +}
>> +
>>   /* Assign pmds to queues.  If 'pinned' is true, assign pmds to pinned
>>    * queues and marks the pmds as isolated.  Otherwise, assign non
>> isolated
>>    * pmds to unpinned queues.
>>    *
>> + * If 'pinned' is false queues will be sorted by processing cycles
>> they are
>> + * consuming and then assigned to pmds in round robin order.
>> + *
>>    * The function doesn't touch the pmd threads, it just stores the
>> assignment
>>    * in the 'pmd' member of each rxq. */
>> @@ -3306,18 +3327,14 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>       struct dp_netdev_port *port;
>>       struct rr_numa_list rr;
>> -
>> -    rr_numa_list_populate(dp, &rr);
>> +    struct dp_netdev_rxq ** rxqs = NULL;
>> +    int i, n_rxqs = 0;
>> +    struct rr_numa *numa = NULL;
>> +    int numa_id;
>>
>>       HMAP_FOR_EACH (port, node, &dp->ports) {
>> -        struct rr_numa *numa;
>> -        int numa_id;
>> -
>>           if (!netdev_is_pmd(port->netdev)) {
>>               continue;
>>           }
>>
>> -        numa_id = netdev_get_numa_id(port->netdev);
>> -        numa = rr_numa_list_lookup(&rr, numa_id);
>> -
>>           for (int qid = 0; qid < port->n_rxq; qid++) {
>>               struct dp_netdev_rxq *q = &port->rxqs[qid];
>> @@ -3337,17 +3354,45 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>>                   }
>>               } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>> -                if (!numa) {
>> -                    VLOG_WARN("There's no available (non isolated)
>> pmd thread "
>> -                              "on numa node %d. Queue %d on port
>> \'%s\' will "
>> -                              "not be polled.",
>> -                              numa_id, qid,
>> netdev_get_name(port->netdev));
>> +                if (n_rxqs == 0) {
>> +                    rxqs = xmalloc(sizeof *rxqs);
>>                   } else {
>> -                    q->pmd = rr_numa_get_pmd(numa);
>> +                    rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>>                   }
>> +                /* Store the queue. */
>> +                rxqs[n_rxqs++] = q;
>>               }
>>           }
>>       }
>>
>> +    if (n_rxqs > 1) {
>> +        /* Sort the queues in order of the processing cycles
>> +         * they consumed during their last pmd interval. */
>> +        qsort(rxqs, n_rxqs, sizeof *rxqs, rxq_cycle_sort);
>> +    }
>> +
>> +    rr_numa_list_populate(dp, &rr);
>> +    /* Assign the sorted queues to pmds in round robin. */
>> +    for (i = 0; i < n_rxqs; i++) {
>> +        numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
>> +        numa = rr_numa_list_lookup(&rr, numa_id);
>> +        if (!numa) {
>> +            VLOG_WARN("There's no available (non isolated) pmd thread "
>> +                      "on numa node %d. Queue %d on port \'%s\' will "
>> +                      "not be polled.",
>> +                      numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
>> +                      netdev_get_name(rxqs[i]->port->netdev));
>> +            continue;
>> +        }
>> +        rxqs[i]->pmd = rr_numa_get_pmd(numa);
>> +        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> +                  "rx queue %d (measured processing cycles %"PRIu64").",
>> +                  rxqs[i]->pmd->core_id, numa_id,
>> +                  netdev_rxq_get_name(rxqs[i]->rx),
>> +                  netdev_rxq_get_queue_id(rxqs[i]->rx),
>> +                  dp_netdev_rxq_get_cycles(rxqs[i],
>> RXQ_CYCLES_PROC_LAST));
> 
> Kevin,
> 
> I've been reviewing and testing this code and found something odd.  The
> measured processing cycles are
> always zero in my setup.
> 
> sample log output:
> 
> 2017-08-08T12:48:25.871Z|00417|dpif_netdev|INFO|Core 6 on numa node 0
> assigned port 'port-em2' rx queue 5 (measured processing cycles
> 10011304791).
> 2017-08-08T12:48:25.871Z|00418|dpif_netdev|INFO|Core 6 on numa node 0
> assigned port 'port-em2' rx queue 4 (measured processing cycles 0).
> 
> Initially I configure my setup with 16 rxq's and a PMD CPU mask of
> 0x1FFFE.  Then I've been testing by running
> iperf traffic with multiple ports 8 or 16 (-P option) to allow
> 'processing cycles' to count up.  Or at least I think that's
> what should be happening.  But when I reconfigure the rxq's and cpu mask
> the processing cycles is always
> zero.
> 

Hi Greg, thanks for trying it out. I see that rxq 5 has measured cycles
so it appears to be just on some queues.

The stat that is showing is the processing cycles that was counted for
the rxq during the last 1 sec run while it was on a pmd. "processing
cycles" counts time to fetch packets and process them but it does not
count time spent polling when there are no rx packets.

There's a few reasons it could be 0:
- The queue is newly added
- There is no rx traffic on that interface
- The interface has not distributed the traffic to that particular rxq
so there is no "processing cycles" done for that queue.

Given the rxq number in the log, I would hazard a guess that it's the
last issue. You could confirm this by setting pmds > total rxqs, so that
each pmd has a max of 1 rxq. Then the pmds stats then can indicate if
there are packets being received on that pmd, and hence rxq. You can
check that setup with
ovs-appctl dpif-netdev/pmd-rxq-show
ovs-appctl dpif-netdev/pmd-stats-clear
ovs-appctl dpif-netdev/pmd-stats-show

If you increase the number of flows so that the RSS in the NIC (IIRC
relies on 5-tuple) can split them across the full range of rxq's it
should solve that issue. Of course there could always be a bug somewhere
too!

> How are you testing this?  Perhaps it's just my test harness or
> something else.
> 

I'm using 2 dpdk ports with flows added to send between them. Externally
I have pktgen-dpdk connected and sending 1K flows so I hit all queues.
Then varying traffic rates, pmds and queue numbers and also using
ovs-appctl dpif-netdev/pmd-rxq-rebalance from 6/6.

> Initial setup:
> 
> ovs-vsctl set Interface port-em2 options:n_rxq=16
> ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1FFFE
> 
> (Note that I do not set affinity - I have read your patch to infer that
> this is for cases without affinitization.)
> 

That's correct, and manual affinitization takes precedence (I need to
add in docs if I haven't). The patchset only changes the how the
non-affinitized rxqs are distributed.

> After getting traffic I then run this setup:
> 
> ovs-vsctl set Interface port-em2 options:n_rxq=4
> ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1E
> 
> Any advice or comment?
> 
> Thanks,
> 
> - Greg
> 

Just sent a v4 also with rebase for head of master.

thanks,
Kevin.

>> +    }
>> +
>>       rr_numa_list_destroy(&rr);
>> +    free(rxqs);
>>   }
>>
>>
>