[ovs-dev] [PATCH v8 5/6] dpif-netdev: Time based output batching.
Ilya Maximets
i.maximets at samsung.com
Wed Dec 20 13:43:55 UTC 2017
On 20.12.2017 14:03, Ilya Maximets wrote:
>>>> +
>>>> + if (need_to_flush) {
>>>> + /* We didn't receive anything in the process loop.
>>>> + * Check if we need to send something.
>>>> + * There was no time updates on current iteration. */
>>>> + pmd_thread_ctx_time_update(pmd);
>>>> + process_packets = dp_netdev_pmd_flush_output_packets(pmd,
>>> false);
>>>> + cycles_count_intermediate(pmd, NULL,
>>>> + process_packets ?
>>> PMD_CYCLES_PROCESSING
>>>> + :
>>>> + PMD_CYCLES_IDLE);
>>>
>>> I noticed the processing cycles for an rxq are not counted here. It means
>>> those processing cycles to tx pkts will no longer be reflected in the rxq
>>> to pmd assignment (or any rxq stats). I realize the tx processing cycles
>>> are now shared so we will have some inaccuracy anyway but for an
>>> individual rxq that has to send to vhost, it could be a significant % of
>>> it's measured cycles.
>>>
>>> OTOH, this code seems like it would only hit when there are low rates (no
>>> packets on any queue during the last poll)? so I'm not sure how
>>> significant it would be in the overall rxq-pmd assignment. e.g. if the rxq
>>> is measured as using 2% or 7% of a pmd it's probably fine for rxq-pmd
>>> assignment, whereas 20% or 70% could create a very imbalanced
>>> distribution.
>>>
>>> If it was significant for rxq-pmd assignment, I'm thinking the best way
>>> would be to add in the cycles required to tx flush each port to each rxq
>>> that has packets in the flush. It's over counting rather than under
>>> counting but we can't assume any shared batching after a new assignment.
>>>
>>> Let me know what you think? Do you think it would only impact the rxq
>>> measurements during low traffic rates?
>>>
>>> Kevin.
>
> Hmm. I made some tests on my PVP with bonded PHY installation with
> tx-flush-interval=50, packet_size=512B and n_flows=1M.
> (full environment description available here:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html )
> Results:
>
> Packet rate
> % of 10G line rate % of packets flushed with rxq = NULL
> ------------------ ------------------------------------
> 10% 27.44%
> 25% 24.03%
> 50% 23.34%
> 60% 18.63%
> 65% 18.40%
> 70% 3.57%
> 75% 1.21%
> 95% 0.04%
>
> Almost all of these flushes happened on core which polls HW NICs only.
> Somewhere between 65% and 70% of 10G line rate HW NICs (10G ixgbe) starts
> to receive at least one packet on each polling cycle.
>
> Do you think above numbers are big enough?
>
> About possible fix:
> It's hard to find out from which rx queue packet was received at the
> send time. We're not even pass this information to processing functions.
> One way is to add last_rxq to pmd context on receive and add this rxq to
> some list near output_pkts inside OVS_ACTION_ATTR_OUTPUT.
> After that we'll have to move 'cycles_count_intermediate' inside
> 'dp_netdev_process_rxq_port' and 'dp_netdev_pmd_flush_output_on_port' where
> we'll count cycles for receiving+processing and sending respectively.
> In this case we'll be able to collect almost accurate cycles for each
> rxq separately and even count impact of each rxq to the batched send and
> add appropriate portion of shared send cycles.
>
> For me, above approach looks like the only possible to keep nearly accurate
> per-rxq statistics in case we need it.
>
> Thoughts?
>
>>
>> Good catch,
>>
>> I was going to push this today but I'll hold off until this issue is resolved, I don’t want an existing feature such as the rxq balancing being negatively impacted upon if we can avoid it.
>>
>> Ian
>
> Ian, in general, I'm good with pushing the series without this patch because
> simple output batching provides significant performance improvement itself
> for bonding/multiflow cases. Maybe we can delay only time-based approach until
> we have solution for rxq-cycles issue.
>
> Best regards, Ilya Maximets.
>
Following incremental should implement above approach (compile tested only):
-------------------------------------------------------------------
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index a612f99..e490793 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -540,6 +540,7 @@ struct tx_port {
struct hmap_node node;
long long flush_time;
struct dp_packet_batch output_pkts;
+ struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
};
/* A set of properties for the current processing loop that is not directly
@@ -551,6 +552,10 @@ struct dp_netdev_pmd_thread_ctx {
long long now;
/* Used to count cycles. See 'cycles_count_end()' */
unsigned long long last_cycles;
+ /* RX queue from which last packet was received. */
+ struct dp_netdev_rxq *last_rxq;
+ /* Indicates how should be treated last counted cycles. */
+ enum pmd_cycles_counter_type current_pmd_cycles_type;
};
/* PMD: Poll modes drivers. PMD accesses devices via polling to eliminate
@@ -3219,42 +3224,53 @@ cycles_counter(void)
/* Fake mutex to make sure that the calls to cycles_count_* are balanced */
extern struct ovs_mutex cycles_counter_fake_mutex;
-/* Start counting cycles. Must be followed by 'cycles_count_end()' */
+/* Start counting cycles. Must be followed by 'cycles_count_end()'.
+ * Counting starts from the idle type state. */
static inline void
cycles_count_start(struct dp_netdev_pmd_thread *pmd)
OVS_ACQUIRES(&cycles_counter_fake_mutex)
OVS_NO_THREAD_SAFETY_ANALYSIS
{
+ pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
pmd->ctx.last_cycles = cycles_counter();
}
-/* Stop counting cycles and add them to the counter 'type' */
+/* Stop counting cycles and add them to the counter of the current type. */
static inline void
-cycles_count_end(struct dp_netdev_pmd_thread *pmd,
- enum pmd_cycles_counter_type type)
+cycles_count_end(struct dp_netdev_pmd_thread *pmd)
OVS_RELEASES(&cycles_counter_fake_mutex)
OVS_NO_THREAD_SAFETY_ANALYSIS
{
unsigned long long interval = cycles_counter() - pmd->ctx.last_cycles;
+ enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
non_atomic_ullong_add(&pmd->cycles.n[type], interval);
}
-/* Calculate the intermediate cycle result and add to the counter 'type' */
+/* Calculate the intermediate cycle result and add to the counter of
+ * the current type */
static inline void
cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
- struct dp_netdev_rxq *rxq,
- enum pmd_cycles_counter_type type)
+ struct dp_netdev_rxq **rxqs, int n_rxqs)
OVS_NO_THREAD_SAFETY_ANALYSIS
{
unsigned long long new_cycles = cycles_counter();
unsigned long long interval = new_cycles - pmd->ctx.last_cycles;
+ enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
+ int i;
+
pmd->ctx.last_cycles = new_cycles;
non_atomic_ullong_add(&pmd->cycles.n[type], interval);
- if (rxq && (type == PMD_CYCLES_PROCESSING)) {
+ if (n_rxqs && (type == PMD_CYCLES_PROCESSING)) {
/* Add to the amount of current processing cycles. */
- non_atomic_ullong_add(&rxq->cycles[RXQ_CYCLES_PROC_CURR], interval);
+ interval /= n_rxqs;
+ for (i = 0; i < n_rxqs; i++) {
+ if (rxqs[i]) {
+ non_atomic_ullong_add(&rxqs[i]->cycles[RXQ_CYCLES_PROC_CURR],
+ interval);
+ }
+ }
}
}
@@ -3299,6 +3315,16 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
int output_cnt;
bool dynamic_txqs;
uint32_t tx_flush_interval;
+ enum pmd_cycles_counter_type save_pmd_cycles_type;
+
+ /* In case we're in PMD_CYCLES_PROCESSING state we need to count
+ * cycles for rxq we're processing now. */
+ cycles_count_intermediate(pmd, &pmd->ctx.last_rxq, 1);
+
+ /* Save current cycles counting state to restore after accounting
+ * send cycles. */
+ save_pmd_cycles_type = pmd->ctx.current_pmd_cycles_type;
+ pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
dynamic_txqs = p->port->dynamic_txqs;
if (dynamic_txqs) {
@@ -3323,6 +3349,9 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
dp_netdev_count_packet(pmd, DP_STAT_SENT_PKTS, output_cnt);
dp_netdev_count_packet(pmd, DP_STAT_SENT_BATCHES, 1);
+ /* Update send cycles for all the rx queues and restore previous state. */
+ cycles_count_intermediate(pmd, p->output_pkts_rxqs, output_cnt);
+ pmd->ctx.current_pmd_cycles_type = save_pmd_cycles_type;
return output_cnt;
}
@@ -3348,7 +3377,7 @@ dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
static int
dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
- struct netdev_rxq *rx,
+ struct dp_netdev_rxq *rxq,
odp_port_t port_no)
{
struct dp_packet_batch batch;
@@ -3356,20 +3385,30 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
int batch_cnt = 0, output_cnt = 0;
dp_packet_batch_init(&batch);
- error = netdev_rxq_recv(rx, &batch);
+
+ cycles_count_intermediate(pmd, NULL, 0);
+ pmd->ctx.last_rxq = rxq;
+ pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
+ error = netdev_rxq_recv(rxq->rx, &batch);
+ cycles_count_intermediate(pmd, &rxq, 1);
+
if (!error) {
*recirc_depth_get() = 0;
pmd_thread_ctx_time_update(pmd);
batch_cnt = batch.count;
dp_netdev_input(pmd, &batch, port_no);
+ cycles_count_intermediate(pmd, &rxq, 1);
+
output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
} else if (error != EAGAIN && error != EOPNOTSUPP) {
static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
- netdev_rxq_get_name(rx), ovs_strerror(error));
+ netdev_rxq_get_name(rxq->rx), ovs_strerror(error));
}
+ pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
+ pmd->ctx.last_rxq = NULL;
return batch_cnt + output_cnt;
}
@@ -3994,7 +4033,6 @@ dpif_netdev_run(struct dpif *dpif)
struct dp_netdev *dp = get_dp_netdev(dpif);
struct dp_netdev_pmd_thread *non_pmd;
uint64_t new_tnl_seq;
- int process_packets;
bool need_to_flush = true;
ovs_mutex_lock(&dp->port_mutex);
@@ -4007,15 +4045,9 @@ dpif_netdev_run(struct dpif *dpif)
int i;
for (i = 0; i < port->n_rxq; i++) {
- process_packets =
- dp_netdev_process_rxq_port(non_pmd,
- port->rxqs[i].rx,
- port->port_no);
- cycles_count_intermediate(non_pmd, NULL,
- process_packets
- ? PMD_CYCLES_PROCESSING
- : PMD_CYCLES_IDLE);
- if (process_packets) {
+ if (dp_netdev_process_rxq_port(non_pmd,
+ &port->rxqs[i],
+ port->port_no)) {
need_to_flush = false;
}
}
@@ -4026,14 +4058,10 @@ dpif_netdev_run(struct dpif *dpif)
* Check if we need to send something.
* There was no time updates on current iteration. */
pmd_thread_ctx_time_update(non_pmd);
- process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
- false);
- cycles_count_intermediate(non_pmd, NULL, process_packets
- ? PMD_CYCLES_PROCESSING
- : PMD_CYCLES_IDLE);
+ dp_netdev_pmd_flush_output_packets(non_pmd, false);
}
- cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
+ cycles_count_end(non_pmd);
dpif_netdev_xps_revalidate_pmd(non_pmd, false);
ovs_mutex_unlock(&dp->non_pmd_mutex);
@@ -4213,17 +4241,11 @@ reload:
cycles_count_start(pmd);
for (;;) {
- int process_packets;
bool need_to_flush = true;
for (i = 0; i < poll_cnt; i++) {
- process_packets =
- dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
- poll_list[i].port_no);
- cycles_count_intermediate(pmd, poll_list[i].rxq,
- process_packets ? PMD_CYCLES_PROCESSING
- : PMD_CYCLES_IDLE);
- if (process_packets) {
+ if (dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
+ poll_list[i].port_no)) {
need_to_flush = false;
}
}
@@ -4233,10 +4255,7 @@ reload:
* Check if we need to send something.
* There was no time updates on current iteration. */
pmd_thread_ctx_time_update(pmd);
- process_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
- cycles_count_intermediate(pmd, NULL,
- process_packets ? PMD_CYCLES_PROCESSING
- : PMD_CYCLES_IDLE);
+ dp_netdev_pmd_flush_output_packets(pmd, false);
}
if (lc++ > 1024) {
@@ -4257,7 +4276,7 @@ reload:
}
}
- cycles_count_end(pmd, PMD_CYCLES_IDLE);
+ cycles_count_end(pmd);
poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
exiting = latch_is_set(&pmd->exit_latch);
@@ -4697,6 +4716,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
ovs_mutex_init(&pmd->port_mutex);
cmap_init(&pmd->flow_table);
cmap_init(&pmd->classifiers);
+ pmd->ctx.last_rxq = NULL;
+ pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
pmd_thread_ctx_time_update(pmd);
pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL;
pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN;
@@ -5575,6 +5596,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
}
DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
+ p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
+ pmd->ctx.last_rxq;
dp_packet_batch_add(&p->output_pkts, packet);
}
return;
-------------------------------------------------------------------
What do you think?
Best regards, Ilya Maximets.
More information about the dev
mailing list