[ovs-dev] Could be a bug: extremely long latency for ovn-controller to get SB update in scale test

Han Zhou zhouhan at gmail.com
Thu Apr 19 18:11:07 UTC 2018


While working on ovn-controller incremental processing, I was testing the
end-to-end latency and comparing with master. I found the latency with
master branch is extremely long while changing the way of testing makes it
much shorter. The master branch was at commit 593e93e5f (ovsdb clustering
was not merged yet). I promised in last ovn meeting to send a report and
here are the details.

*Precondition: *
there are 1k sandbox HVs connected, and 10k lports already created and
bound on these HVs. There are 8 lrouters, each connected with 5 lswitches,
and each lswitch has 250 lports.

*Test: *
- on a sandbox (HV1), create a ovs-port:
# ovs-vsctl add-port br-int lsp1 -- set interface lsp1 type=internal -- set
interface lsp1 external-ids:iface-id=lsp1

- create the lport on NB on a lswitch that is not yet in local datapath of
HV1, wait for port up and then sync HVs
# time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 -- lsp-set-addresses
lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; time ovn-nbctl wait-until
Logical_Switch_Port lsp1 up=true; time ovn-nbctl --wait=hv sync

*Result:*
real    0m0.408s
user    0m0.316s
sys     0m0.008s

real    0m4.837s
user    0m0.348s
sys     0m0.012s

*real    7m48.373s*
user    0m0.348s
sys     0m0.016s

*Notes: *

- the time for the last --wait=hv sync is extremely long: 7m48s.

- the test is with the patch for nb_cfg separation (
https://patchwork.ozlabs.org/patch/899608/), so that --wait=hv sync is
optimized without notification flooding.

- the lswitch where the lsp1 is added doesn't belong to any existing local
datapath on HV1, i.e., there is no other existing lport on HV1 in this
lswitch or any other lswitch that is connected with this lswitch by the
same lrouter. This means when adding the lport the HV1 will change monitor
condition to watch on a new set of datapaths and port-bindings on them.
 (tested with adding the lsp1 on an existing local datapath, there is no
such extremely long latency.)

- adding a command to sync hvs before waiting for port up changes the
behavior: the total latency is only 20+ seconds then:
     # time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 --
lsp-set-addresses lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; *time ovn-nbctl
--wait=hv sync;* time ovn-nbctl wait-until Logical_Switch_Port lsp1
up=true; time ovn-nbctl --wait=hv sync

*Debug log analysis:*

With debug log enabled for ovn-controller, I found that the latency was
because ovn-controller didn't get any SB update during the 7+ minutes. Here
are the related logs and analysis:

// HV1 got notification of lsp1 port binding added by northd after the lsp1
is created in NB
2018-03-20T00:00:07.353Z|51461|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Port_Binding":{"d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c":{"insert":{"tunnel_key":252,"mac":"aa:aa:aa:aa:aa:aa
172.145.222.222","logical_port":"lsp1","datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"]}}}}]


// Claim the lsp1 on this chassis
2018-03-20T00:00:07.367Z|51465|binding|INFO|Claiming lport lsp1 for this
chassis.
2018-03-20T00:00:07.367Z|51466|binding|INFO|lsp1: Claiming
aa:aa:aa:aa:aa:aa 172.145.222.222

// Update port binding to this chassis
2018-03-20T00:00:11.381Z|51588|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method="transact",
params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c"]]],"row":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]},"op":"update","table":"Port_Binding"}],
id=*2169*

// Send request to change monitor conditions for new local datapaths
2018-03-20T00:00:11.382Z|51594|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method=“monitor_cond_change” … id=*2170*

// Received notification for its own port-binding update
2018-03-20T00:00:11.383Z|51595|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Port_Binding":{"d6f145bf-ac8f-4107-a565-9ef1fcc9ec6c":{"modify":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]}}}}]

2018-03-20T00:00:11.383Z|51596|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result=[{"count":1}], id=*2169*

*///////////////// 7 minutes later ... but why???*

// Received the notification for the new monitor condition related data
(it's huge data)
2018-03-20T00:07:56.183Z|52643|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Multicast_Group":{"944391c4-7bf9-4d46-9662-966f9a5450e0":{"insert":{"ports":["set",[["uuid","00d22282-ac38-4a4e-8f47-3684fa915051"],["uuid","03010cdb-647b-4a1d-b0ca-7f4b0e7a6862"],["uuid","03ba954f-5c96-47f0-b517-4f30b2e61907"],["uuid","0480cf96-5158-45aa-ba49-6053cdec5ce3"],["uuid","052397f5-ebb3-4e81-9970-...(huge
data)...

2018-03-20T00:07:56.386Z|52644|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result={}, id=

*2170*
// Received the nb_cfg change triggered by the --wait=hv sync command,
together with the lflow update caused by the port status change: up=true
2018-03-20T00:07:56.387Z|52645|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":45}}},"Logical_Flow":{"3747dba9-9c5f-46bf-a410-4df2e842db58":{"insert":{"match":"arp.tpa
== 172.145.222.222 && arp.op ==
1","pipeline":"ingress","priority":50,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3707"],["stage-name","ls_in_arp_rsp"]]],"actions":"eth.dst
= eth.src; eth.src = aa:aa:aa:aa:aa:aa; arp.op = 2; /* ARP reply */ arp.tha
= arp.sha; arp.sha = aa:aa:aa:aa:aa:aa; arp.tpa = arp.spa; arp.spa =
172.145.222.222; outport = inport; flags.loopback = 1;
output;"}},"546d7f4f-f82c-4449-bf93-eaacfc35f929":{"insert":{"match":"arp.tpa
== 172.145.222.222 && arp.op == 1 && inport ==
\"lsp1\"","pipeline":"ingress","priority":100,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3723"],["stage-name","ls_in_arp_rsp"]]],"actions":"next;"}}}}]


// Respond to the nb_cfg change
2018-03-20T00:08:05.165Z|58829|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method="transact",
params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":45},"op":"update","table":"Chassis_NB_Cfg"}],
id=2172

========= the end ========



*Compare:*
Below is the one without so long latency, due to an additional --wait=hv
sync in between.
(# time ovn-nbctl lsp-add lswitch_b9322b_ecpmKt lsp1 -- lsp-set-addresses
lsp1 "aa:aa:aa:aa:aa:aa 172.145.222.222"; *time ovn-nbctl --wait=hv sync;*
time ovn-nbctl wait-until Logical_Switch_Port lsp1 up=true; time ovn-nbctl
--wait=hv sync)

// HV1 got notification of lsp1 port binding added by northd after the lsp1
is created in NB
2018-03-20T00:19:32.350Z|65939|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Port_Binding":{"57b18b07-c373-4feb-b539-b8ec20edd89e":{"insert":{"tunnel_key":252,"mac":"aa:aa:aa:aa:aa:aa
172.145.222.222","logical_port":"lsp1","datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"]}}}}]

// Claim the lsp1 on this chassis
2018-03-20T00:19:32.365Z|65940|binding|INFO|Claiming lport lsp1 for this
chassis.
2018-03-20T00:19:32.365Z|65941|binding|INFO|lsp1: Claiming
aa:aa:aa:aa:aa:aa 172.145.222.222

// Update port binding to this chassis
2018-03-20T00:19:36.807Z|66061|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method="transact",
params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","57b18b07-c373-4feb-b539-b8ec20edd89e"]]],"row":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]},"op":"update","table":"Port_Binding"}],
id=2177

// Send request to change monitor conditions for new local datapaths
2018-03-20T00:19:36.808Z|66067|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method=“monitor_cond_change” … id=2178

// Received notification of nb_cfg update triggered by the first --wait=hv
sync command
2018-03-20T00:19:36.809Z|66068|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":48}}}}]

// Received notification for its own port-binding update
2018-03-20T00:19:36.809Z|66069|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Port_Binding":{"57b18b07-c373-4feb-b539-b8ec20edd89e":{"modify":{"chassis":["uuid","b2f297f6-cd8b-4429-94f0-274a24d6640a"]}}}}]
2018-03-20T00:19:36.809Z|66070|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result=[{"count":1}], id=2177

// Send update to respond the nb_cfg update
2018-03-20T00:19:41.315Z|66075|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method="transact",
params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":48},"op":"update","table":"Chassis_NB_Cfg"}],
id=2179

// Received the notification for the new monitor condition related data
(it's huge data)
2018-03-20T00:19:42.249Z|66276|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Multicast_Group":{"944391c4-7bf9-4d46-9662-966f9a5450e0":{"insert":{"ports":["set",[["uuid","00d22282-ac38-4a4e-8f47-3684fa915051"],["uuid","03010cdb-647b-4a1d-b0ca-7f4b0e7a6862"],["uuid","03ba954f-5c96-47f0-...(huge
data)...

2018-03-20T00:19:42.588Z|66277|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result={}, id=2178

// Received the lflow update caused by the port status change: up=true
2018-03-20T00:19:42.588Z|66278|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Logical_Flow":{"fac832c4-461d-4030-8649-9fccca5d6a3b":{"insert":{"match":"arp.tpa
== 172.145.222.222 && arp.op == 1 && inport ==
\"lsp1\"","pipeline":"ingress","priority":100,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3723"],["stage-name","ls_in_arp_rsp"]]],"actions":"next;"}},"0b719145-7bf6-4f26-a2b5-a7a5c8a9494e":{"insert":{"match":"arp.tpa
== 172.145.222.222 && arp.op ==
1","pipeline":"ingress","priority":50,"logical_datapath":["uuid","bb979633-789e-4ef8-9663-51839f000c0f"],"table_id":11,"external_ids":["map",[["source","ovn-northd.c:3707"],["stage-name","ls_in_arp_rsp"]]],"actions":"eth.dst
= eth.src; eth.src = aa:aa:aa:aa:aa:aa; arp.op = 2; /* ARP reply */ arp.tha
= arp.sha; arp.sha = aa:aa:aa:aa:aa:aa; arp.tpa = arp.spa; arp.spa =
172.145.222.222; outport = inport; flags.loopback = 1; output;"}}}}]

// Received the change notification for its own update for the first nb_cfg
change
2018-03-20T00:19:42.588Z|66279|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Chassis_NB_Cfg":{"e2e7e93f-f640-4860-a3aa-fbd26e18ca97":{"modify":{"nb_cfg":48}}}}]

2018-03-20T00:19:42.588Z|66280|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result=[{"count":1}], id=2179

// Received the nb_cfg change triggered by the second --wait=hv sync command
2018-03-20T00:19:54.736Z|72055|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"SB_Global":{"555d0a36-82d4-45b4-88cc-c9164d913066":{"modify":{"nb_cfg":49}}}}]


// ...
2018-03-20T00:19:59.693Z|72060|jsonrpc|DBG|tcp:192.168.10.10:6640: send
request, method="transact",
params=["OVN_Southbound",{"where":[["_uuid","==",["uuid","e2e7e93f-f640-4860-a3aa-fbd26e18ca97"]]],"row":{"nb_cfg":49},"op":"update","table":"Chassis_NB_Cfg"}],
id=2181

// ...
2018-03-20T00:19:59.694Z|72065|jsonrpc|DBG|tcp:192.168.10.10:6640: received
notification, method="update2",
params=["monid",{"Chassis_NB_Cfg":{"e2e7e93f-f640-4860-a3aa-fbd26e18ca97":{"modify":{"nb_cfg":49}}}}]

2018-03-20T00:19:59.694Z|72066|jsonrpc|DBG|tcp:192.168.10.10:6640: received
reply, result=[{"count":1}], id=2181

====== the end (the correct one) =====

Thanks,
Han


More information about the dev mailing list