[ovs-dev] 10-25 packet drops every few (10-50) seconds TCP (iperf3)

Tue Jun 30 16:50:16 UTC 2020

Right, you might want to review Documentation/timers/no_hz.rst from
the kernel sources and look for RCU implications section where
it explains how to move RCU callbacks.

fbl

On Tue, Jun 30, 2020 at 12:08:05PM -0400, Shahaji Bhosle wrote:
> Hi Flavio,
> I wrote a small program which has do_nothing for loop and I measure the
> timestamps across the do nothing loop. I am seeing 3% of the time around
> the 1 second mark when the arch_timer fires I get the timestamps to be off
> by 25% of the exprected value. I ran trace-cmd to see what is going on and
> see the below. Looks like some issue with *gic_handle_irg*(), not seeing
> tihs behaviour on x86 host, something special with ARM v8.
> Thanks, Shahaji
> 
>   %21.77  (14181) arm_stb_user_lo                    rcu_dyntick #922
>          |
>          --- *rcu_dyntick*
>             |
>             |--%46.85-- gic_handle_irq  # 432
>             |
>             |--%23.32-- context_tracking_user_exit  # 215
>             |
>             |--%22.34-- context_tracking_user_enter  # 206
>             |
>             |--%2.60-- SyS_execve  # 24
>             |
>             |--%1.30-- do_page_fault  # 12
>             |
>             |--%0.65-- SyS_write  # 6
>             |
>             |--%0.65-- schedule  # 6
>             |
>             |--%0.65-- SyS_nanosleep  # 6
>             |
>             |--%0.65-- syscall_trace_enter  # 6
>             |
>             |--%0.65-- SyS_faccessat  # 6
> 
>   %5.01  (14181) arm_stb_user_lo                rcu_utilization #212
>          |
>          --- *rcu_utilization*
>             |
>             |--%96.23-- gic_handle_irq  # 204
>             |
>             |--%1.89-- SyS_nanosleep  # 4
>             |
>             |--%0.94-- SyS_exit_group  # 2
>             |
>             |--%0.94-- do_notify_resume  # 2
> 
>   %4.86  (14181) arm_stb_user_lo                      user_exit #206
>          |
>          --- *user_exit*
>           context_tracking_user_exit
> 
>   %4.86  (14181) arm_stb_user_lo     context_tracking_user_exit #206
>          |
>          --- context_tracking_user_exit
> 
>   %4.86  (14181) arm_stb_user_lo    context_tracking_user_enter #206
>          |
>          --- context_tracking_user_enter
> 
>   %4.86  (14181) arm_stb_user_lo                     user_enter #206
>          |
>          --- *user_enter*
>           context_tracking_user_enter
> 
>   %2.95  (14181) arm_stb_user_lo                 gic_handle_irq #125
>          |
>          --- gic_handle_irq
> 
> 
> On Tue, Jun 30, 2020 at 9:45 AM Flavio Leitner <fbl at sysclose.org> wrote:
> 
> > On Tue, Jun 02, 2020 at 12:56:51PM -0700, Vinay Gupta wrote:
> > > Hi Flavio,
> > >
> > > Thanks for your reply.
> > > I have captured the suggested information but do not see anything that
> > > could cause the packet drops.
> > > Can you please take a look at the below data and see if you can find
> > > something unusual ?
> > > The PMDs are running on CPU 1,2,3,4 and CPU 1-7 are isolated cores.
> > >
> > >
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > root at bcm958802a8046c:~# cstats ; sleep 10; cycles
> > > pmd thread numa_id 0 core_id 1:
> > >   idle cycles: 99140849 (7.93%)
> > >   processing cycles: 1151423715 (92.07%)
> > >   avg cycles per packet: 116.94 (1250564564/10693918)
> > >   avg processing cycles per packet: 107.67 (1151423715/10693918)
> > > pmd thread numa_id 0 core_id 2:
> > >   idle cycles: 118373662 (9.47%)
> > >   processing cycles: 1132193442 (90.53%)
> > >   avg cycles per packet: 124.39 (1250567104/10053309)
> > >   avg processing cycles per packet: 112.62 (1132193442/10053309)
> > > pmd thread numa_id 0 core_id 3:
> > >   idle cycles: 53805933 (4.30%)
> > >   processing cycles: 1196762002 (95.70%)
> > >   avg cycles per packet: 107.35 (1250567935/11649948)
> > >   avg processing cycles per packet: 102.73 (1196762002/11649948)
> > > pmd thread numa_id 0 core_id 4:
> > >   idle cycles: 189102938 (15.12%)
> > >   processing cycles: 1061463293 (84.88%)
> > >   avg cycles per packet: 143.47 (1250566231/8716828)
> > >   avg processing cycles per packet: 121.77 (1061463293/8716828)
> > > pmd thread numa_id 0 core_id 5:
> > > pmd thread numa_id 0 core_id 6:
> > > pmd thread numa_id 0 core_id 7:
> >
> >
> > The core_id 3 is high loaded, and then it's more likely to show
> > the drop issue when some other event happens.
> >
> > I think you need to run perf as I recommended before and see if
> > there are context switches happening and why they are happening.
> >
> > If a context switch happens, it's either because the core is not
> > well isolated or some other thing is going on. It will help to
> > understand why the queue wasn't serviced for a certain amount of
> > time.
> >
> > The issue is that running perf might introduce some load, so you
> > will need adjust the traffic rate accordingly.
> >
> > HTH,
> > fbl
> >
> >
> >
> > >
> > >
> > > *Runtime summary*                          comm  parent   sched-in
> > > run-time    min-run     avg-run     max-run  stddev  migrations
> > >                                           (count)       (msec)     (msec)
> > >    (msec)      (msec)       %
> > >
> > ---------------------------------------------------------------------------------------------------------------------
> > >                 ksoftirqd/0[7]       2          1        0.079      0.079
> > >     0.079       0.079    0.00       0
> > >                   rcu_sched[8]       2         14        0.067      0.002
> > >     0.004       0.009    9.96       0
> > >                    rcuos/4[38]       2          6        0.027      0.002
> > >     0.004       0.008   20.97       0
> > >                    rcuos/5[45]       2          4        0.018      0.004
> > >     0.004       0.005    6.63       0
> > >                kworker/0:1[71]       2         12        0.156      0.008
> > >     0.013       0.019    6.72       0
> > >                  mmcqd/0[1230]       2          3        0.054      0.001
> > >     0.018       0.031   47.29       0
> > >             kworker/0:1H[1248]       2          1        0.006      0.006
> > >     0.006       0.006    0.00       0
> > >            kworker/u16:2[1547]       2         16        0.045      0.001
> > >     0.002       0.012   26.19       0
> > >                     ntpd[5282]       1          1        0.063      0.063
> > >     0.063       0.063    0.00       0
> > >                 watchdog[6988]       1          2        0.089      0.012
> > >     0.044       0.076   72.26       0
> > >             ovs-vswitchd[9239]       1          2        0.326      0.152
> > >     0.163       0.173    6.45       0
> > >        revalidator8[9309/9239]    9239          2        1.260      0.607
> > >     0.630       0.652    3.58       0
> > >                    perf[27150]   27140          1        0.000      0.000
> > >     0.000       0.000    0.00       0
> > >
> > > Terminated tasks:
> > >                   sleep[27151]   27150          4        1.002      0.015
> > >     0.250       0.677   58.22       0
> > >
> > > Idle stats:
> > >     CPU  0 idle for    999.814  msec  ( 99.84%)
> > >
> > >
> > >
> > > *CPU  1 idle entire time window    CPU  2 idle entire time window
> > CPU  3
> > > idle entire time window    CPU  4 idle entire time window*
> > >     CPU  5 idle for    500.326  msec  ( 49.96%)
> > >     CPU  6 idle entire time window
> > >     CPU  7 idle entire time window
> > >
> > >     Total number of unique tasks: 14
> > > Total number of context switches: 115
> > >            Total run time (msec):  3.198
> > >     Total scheduling time (msec): 1001.425  (x 8)
> > > (END)
> > >
> > >
> > >
> > > *02:16:22      UID      TGID       TID    %usr %system  %guest   %wait
> > >  %CPU   CPU  Command *02:16:23        0      9239         -  100.00
> > 0.00
> > >    0.00    0.00  100.00     5  ovs-vswitchd
> > > 02:16:23        0         -      9239    2.00    0.00    0.00    0.00
> > >  2.00     5  |__ovs-vswitchd
> > > 02:16:23        0         -      9240    0.00    0.00    0.00    0.00
> > >  0.00     0  |__vfio-sync
> > > 02:16:23        0         -      9241    0.00    0.00    0.00    0.00
> > >  0.00     5  |__eal-intr-thread
> > > 02:16:23        0         -      9242    0.00    0.00    0.00    0.00
> > >  0.00     5  |__dpdk_watchdog1
> > > 02:16:23        0         -      9244    0.00    0.00    0.00    0.00
> > >  0.00     5  |__urcu2
> > > 02:16:23        0         -      9279    0.00    0.00    0.00    0.00
> > >  0.00     5  |__ct_clean3
> > > 02:16:23        0         -      9308    0.00    0.00    0.00    0.00
> > >  0.00     5  |__handler9
> > > 02:16:23        0         -      9309    0.00    0.00    0.00    0.00
> > >  0.00     5  |__revalidator8
> > > 02:16:23        0         -      9328    0.00    0.00    0.00    0.00
> > >  0.00     6  |__pmd13
> > > 02:16:23        0         -      9330  100.00    0.00    0.00    0.00
> > >  100.00     3  |__pmd12
> > > 02:16:23        0         -      9331  100.00    0.00    0.00    0.00
> > >  100.00     1  |__pmd11
> > > 02:16:23        0         -      9332    0.00    0.00    0.00    0.00
> > >  0.00     7  |__pmd10
> > > 02:16:23        0         -      9333    0.00    0.00    0.00    0.00
> > >  0.00     5  |__pmd16
> > > 02:16:23        0         -      9334  100.00    0.00    0.00    0.00
> > >  100.00     2  |__pmd15
> > > 02:16:23        0         -      9335  100.00    0.00    0.00    0.00
> > >  100.00     4  |__pmd14
> > >
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > >
> > > Thanks
> > > Vinay
> > >
> > > On Tue, Jun 2, 2020 at 12:06 PM Flavio Leitner <fbl at sysclose.org> wrote:
> > >
> > > > On Mon, Jun 01, 2020 at 07:27:09PM -0400, Shahaji Bhosle via dev wrote:
> > > > > Hi Ben/Ilya,
> > > > > Hope you guys are doing well and staying safe. I have been chasing a
> > > > weird
> > > > > problem with small drops and I think that is causing lots of TCP
> > > > > retransmission.
> > > > >
> > > > > Setup details
> > > > > iPerf3(1k-5K
> > > > > Servers)<--DPDK2:OvS+DPDK(VxLAN:BOND)[DPDK0+DPDK1)<====2x25G<====
> > > > > [DPDK0+DPDK1)(VxLAN:BOND)OVS+DPDKDPDK2<---iPerf3(Clients)
> > > > >
> > > > > All the Drops are ring drops on BONDed functions on the server
> > side.  I
> > > > > have 4 CPUs each with 3PMD threads, DPDK0, DPDK1 and DPDK2 all
> > running
> > > > with
> > > > > 4 Rx rings each.
> > > > >
> > > > > What is interesting is when I give each Rx rings its own CPU the
> > drops go
> > > > > away. Or if I set cother_config:emc-insert-inv-prob=1 the drops go
> > away.
> > > > > But I need to scale up the number of flows so trying to run this
> > with EMC
> > > > > disabled.
> > > > >
> > > > > I can tell that the rings are not getting serviced for 30-40usec
> > because
> > > > of
> > > > > some kind context switch or interrupts on these cores. I have tried
> > to do
> > > > > the usual isolation, nohz_full rcu_nocbs etc. Move all the interrupts
> > > > away
> > > > > from these cores etc. But nothing helps. I mean it improves, but the
> > > > drops
> > > > > still happen.
> > > >
> > > > When you disable the EMC (or reduce its efficiency) the per packet cost
> > > > increases, then it becomes more sensitive to variations. If you share
> > > > a CPU with multiple queues, you decrease the amount of time available
> > > > to process the queue. In either case, there will be less room to
> > tolerate
> > > > variations.
> > > >
> > > > Well, you might want to use 'perf' and monitor for the scheduling
> > events
> > > > and then based on the stack trace see what is causing it and try to
> > > > prevent it.
> > > >
> > > > For example:
> > > > # perf record -e sched:sched_switch -a -g sleep 1
> > > >
> > > > For instance, you might see that another NIC used for management has
> > > > IRQs assigned to one isolated CPU. You can move it to another CPU to
> > > > reduce the noise, etc...
> > > >
> > > > Another suggestion is look at PMD thread idle statistics because it
> > > > will tell you how much "extra" room you have left. As it approaches
> > > > to 0, more fine tuned your setup needs to be to avoid drops.
> > > >
> > > > HTH,
> > > > --
> > > > fbl
> > > >
> >
> > --
> > fbl
> >

-- 
fbl