[ovs-dev] ovs-vswitchd too large memory consumption with OVN stateless ACL

Vladislav Odintsov odivlad at gmail.com
Fri Nov 19 18:12:48 UTC 2021


Hi,

I’m testing OVN stateless ACL rules with `$port_group_ipVERSION` in match portion.
There’s a strange behaviour and sometimes I got configuration, which totally kills my transport nodes, where logical switch ports reside.
ovs-vswitchd and ovn-controller processes utilise 100% 1 core CPU each and ovs-vswitchd consumes all free memory and repeatedly got killed by OOM-killer. It consumes 5GB memory in 5-10 seconds!

I reproduced this with OVS 2.13.4 & OVN main, but also tried with actual OVS master branch and the problem still reproduces.

Below are steps to reproduce:

* It’s better to reproduce this problem with limited by memory ovs-vswitchd with cgroups or something similar *

1. ovn main branch; openvswitch master branch.
2. deploy one-node ovn installation with ovn-controller
3. create ovn topology:
ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 lsp1
ovn-nbctl lsp-add ls1 lsp2
ovn-nbctl lsp-add ls1 lsp3
ovn-nbctl lsp-add ls1 lsp4
ovn-nbctl lsp-set-addresses lsp1 '0a:01:bf:a1:81:e1 172.31.16.6'
ovn-nbctl lsp-set-addresses lsp2 '0a:01:e7:98:59:61 172.31.16.8'
ovn-nbctl lsp-set-addresses lsp3 '0a:01:85:bf:9d:a1 172.31.16.9'
ovn-nbctl lsp-set-addresses lsp4 '0a:01:a1:d7:e9:01 172.31.16.7'
ovn-nbctl pg-add pg1 lsp1 lsp2 lsp3 lsp4
ovn-nbctl acl-add pg1 from-lport 2000 'ip4.src != $pg1_ip4' allow-stateless

# if not limited ovs-vswitchd with cgroups, it’s recommended to do so or stop ovn-controller here not to get stuck with exhausted system resources.

4. then plug LSPs to ovs:
ovs-vsctl add-port br-int p1 -- set interface p1 external-ids:iface-id=lsp1
ovs-vsctl add-port br-int p2 -- set interface p2 external-ids:iface-id=lsp2
ovs-vsctl add-port br-int p3 -- set interface p3 external-ids:iface-id=lsp3
ovs-vsctl add-port br-int p4 -- set interface p4 external-ids:iface-id=lsp4

5. carefully start ovn-controller; check ovs-vswitchd memory consumption and ovn-controller & ovs-vswitchd CPU consumption; stop ovn-controller in order not to kill your machine.

In OVS logs there are no abnormal messages:
2021-11-19T18:09:36.876Z|00039|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.16.90
2021-11-19T18:10:07.423Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2021-11-19T18:10:07.429Z|00002|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 0
2021-11-19T18:10:07.429Z|00003|ovs_numa|INFO|Discovered 16 CPU cores on NUMA node 1
2021-11-19T18:10:07.429Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 32 CPU cores
2021-11-19T18:10:07.429Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2021-11-19T18:10:07.429Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2021-11-19T18:10:07.448Z|00007|ofproto_dpif|INFO|system at ovs-system: Datapath supports recirculation
2021-11-19T18:10:07.449Z|00008|ofproto_dpif|INFO|system at ovs-system: VLAN header stack length probed as 2
2021-11-19T18:10:07.449Z|00009|ofproto_dpif|INFO|system at ovs-system: MPLS label stack length probed as 3
2021-11-19T18:10:07.449Z|00010|ofproto_dpif|INFO|system at ovs-system: Datapath supports truncate action
2021-11-19T18:10:07.449Z|00011|ofproto_dpif|INFO|system at ovs-system: Datapath supports unique flow ids
2021-11-19T18:10:07.449Z|00012|ofproto_dpif|INFO|system at ovs-system: Datapath supports clone action
2021-11-19T18:10:07.449Z|00013|ofproto_dpif|INFO|system at ovs-system: Max sample nesting level probed as 10
2021-11-19T18:10:07.449Z|00014|ofproto_dpif|INFO|system at ovs-system: Datapath supports eventmask in conntrack action
2021-11-19T18:10:07.449Z|00015|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_clear action
2021-11-19T18:10:07.449Z|00016|ofproto_dpif|INFO|system at ovs-system: Max dp_hash algorithm probed to be 0
2021-11-19T18:10:07.449Z|00017|ofproto_dpif|INFO|system at ovs-system: Datapath supports check_pkt_len action
2021-11-19T18:10:07.449Z|00018|ofproto_dpif|INFO|system at ovs-system: Datapath supports timeout policy in conntrack action
2021-11-19T18:10:07.449Z|00019|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_zero_snat
2021-11-19T18:10:07.449Z|00020|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_state
2021-11-19T18:10:07.449Z|00021|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_zone
2021-11-19T18:10:07.449Z|00022|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_mark
2021-11-19T18:10:07.449Z|00023|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_label
2021-11-19T18:10:07.449Z|00024|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_state_nat
2021-11-19T18:10:07.449Z|00025|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_orig_tuple
2021-11-19T18:10:07.449Z|00026|ofproto_dpif|INFO|system at ovs-system: Datapath supports ct_orig_tuple6
2021-11-19T18:10:07.449Z|00027|ofproto_dpif|INFO|system at ovs-system: Datapath does not support IPv6 ND Extensions
2021-11-19T18:10:07.450Z|00028|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 32, setting n-revalidator-threads to 9
2021-11-19T18:10:07.450Z|00029|ofproto_dpif_upcall|INFO|Starting 41 threads
2021-11-19T18:10:07.508Z|00030|bridge|INFO|bridge br-int: added interface ovn-ai10-0 on port 4
2021-11-19T18:10:07.508Z|00031|bridge|INFO|bridge br-int: added interface br-int on port 65534
2021-11-19T18:10:07.509Z|00032|bridge|INFO|bridge br-int: added interface ovn-ai07-0 on port 5
2021-11-19T18:10:07.509Z|00033|bridge|INFO|bridge br-int: added interface ovn-ai05-0 on port 6
2021-11-19T18:10:07.509Z|00034|bridge|INFO|bridge br-int: added interface ovn-ai08-0 on port 7
2021-11-19T18:10:07.509Z|00035|bridge|INFO|bridge br-int: added interface vif0-DA9875C1 on port 3
2021-11-19T18:10:07.510Z|00036|bridge|INFO|bridge br-int: added interface ovn-ai09-0 on port 8
2021-11-19T18:10:07.511Z|00037|bridge|INFO|bridge br-int: using datapath ID 000026923f220147
2021-11-19T18:10:07.511Z|00038|connmgr|INFO|br-int: added service controller "punix:/var/run/openvswitch/br-int.mgmt"
2021-11-19T18:10:07.517Z|00039|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.16.90


in ovn-controller:

2021-11-19T18:10:07.597Z|00003|reconnect|INFO|unix:/run/openvswitch/db.sock: connected
2021-11-19T18:10:07.602Z|00004|main|INFO|OVN internal version is : [21.09.90-20.21.0-56.3]
2021-11-19T18:10:07.602Z|00005|main|INFO|OVS IDL reconnected, force recompute.
2021-11-19T18:10:07.606Z|00006|reconnect|INFO|ssl:172.20.34.104:6642: connecting...
2021-11-19T18:10:07.606Z|00007|main|INFO|OVNSB IDL reconnected, force recompute.
2021-11-19T18:10:07.606Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: connecting to switch
2021-11-19T18:10:07.606Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: connecting...
2021-11-19T18:10:07.606Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: connected
2021-11-19T18:10:07.658Z|00008|reconnect|INFO|ssl:172.20.34.104:6642: connected
2021-11-19T18:10:07.666Z|00009|features|INFO|unix:/run/openvswitch/br-int.mgmt: connecting to switch
2021-11-19T18:10:07.666Z|00010|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connecting...
2021-11-19T18:10:07.666Z|00011|features|INFO|OVS Feature: ct_zero_snat, state: supported
2021-11-19T18:10:07.666Z|00012|main|INFO|OVS feature set changed, force recompute.
2021-11-19T18:10:07.666Z|00013|ofctrl|INFO|unix:/run/openvswitch/br-int.mgmt: connecting to switch
2021-11-19T18:10:07.666Z|00014|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connecting...
2021-11-19T18:10:07.667Z|00015|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connected
2021-11-19T18:10:07.682Z|00016|rconn|INFO|unix:/run/openvswitch/br-int.mgmt: connected
2021-11-19T18:10:07.683Z|00017|main|INFO|OVS feature set changed, force recompute.
2021-11-19T18:10:09.012Z|00018|timeval|WARN|Unreasonably long 1326ms poll interval (1140ms user, 185ms system)
2021-11-19T18:10:09.012Z|00019|timeval|WARN|faults: 138008 minor, 0 major
2021-11-19T18:10:09.012Z|00020|timeval|WARN|context switches: 1 voluntary, 2 involuntary
2021-11-19T18:10:09.013Z|00021|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=81c8f620:
2021-11-19T18:10:09.013Z|00022|coverage|INFO|lflow_run                  0.0/sec     0.000/sec        0.0000/sec   total: 2
2021-11-19T18:10:09.013Z|00023|coverage|INFO|lflow_cache_add_expr       0.0/sec     0.000/sec        0.0000/sec   total: 24
2021-11-19T18:10:09.013Z|00024|coverage|INFO|lflow_cache_add_matches    0.0/sec     0.000/sec        0.0000/sec   total: 86
2021-11-19T18:10:09.013Z|00025|coverage|INFO|lflow_cache_add            0.0/sec     0.000/sec        0.0000/sec   total: 110
2021-11-19T18:10:09.013Z|00026|coverage|INFO|lflow_cache_miss           0.0/sec     0.000/sec        0.0000/sec   total: 131
2021-11-19T18:10:09.013Z|00027|coverage|INFO|physical_run               0.0/sec     0.000/sec        0.0000/sec   total: 3
2021-11-19T18:10:09.013Z|00028|coverage|INFO|miniflow_malloc            0.0/sec     0.000/sec        0.0000/sec   total: 257413
2021-11-19T18:10:09.013Z|00029|coverage|INFO|hmap_pathological          0.0/sec     0.000/sec        0.0000/sec   total: 65
2021-11-19T18:10:09.013Z|00030|coverage|INFO|hmap_expand                0.0/sec     0.000/sec        0.0000/sec   total: 1515
2021-11-19T18:10:09.013Z|00031|coverage|INFO|txn_unchanged              0.0/sec     0.000/sec        0.0000/sec   total: 19
2021-11-19T18:10:09.013Z|00032|coverage|INFO|txn_incomplete             0.0/sec     0.000/sec        0.0000/sec   total: 13
2021-11-19T18:10:09.013Z|00033|coverage|INFO|txn_success                0.0/sec     0.000/sec        0.0000/sec   total: 2
2021-11-19T18:10:09.013Z|00034|coverage|INFO|poll_create_node           0.0/sec     0.000/sec        0.0000/sec   total: 173
2021-11-19T18:10:09.013Z|00035|coverage|INFO|poll_zero_timeout          0.0/sec     0.000/sec        0.0000/sec   total: 5
2021-11-19T18:10:09.013Z|00036|coverage|INFO|rconn_queued               0.0/sec     0.000/sec        0.0000/sec   total: 128717
2021-11-19T18:10:09.013Z|00037|coverage|INFO|rconn_sent                 0.0/sec     0.000/sec        0.0000/sec   total: 32380
2021-11-19T18:10:09.013Z|00038|coverage|INFO|seq_change                 0.0/sec     0.000/sec        0.0000/sec   total: 67
2021-11-19T18:10:09.013Z|00039|coverage|INFO|pstream_open               0.0/sec     0.000/sec        0.0000/sec   total: 1
2021-11-19T18:10:09.013Z|00040|coverage|INFO|stream_open                0.0/sec     0.000/sec        0.0000/sec   total: 5
2021-11-19T18:10:09.013Z|00041|coverage|INFO|util_xalloc                0.0/sec     0.000/sec        0.0000/sec   total: 2111348
2021-11-19T18:10:09.013Z|00042|coverage|INFO|vconn_open                 0.0/sec     0.000/sec        0.0000/sec   total: 3
2021-11-19T18:10:09.013Z|00043|coverage|INFO|vconn_received             0.0/sec     0.000/sec        0.0000/sec   total: 13
2021-11-19T18:10:09.013Z|00044|coverage|INFO|vconn_sent                 0.0/sec     0.000/sec        0.0000/sec   total: 32384
2021-11-19T18:10:09.013Z|00045|coverage|INFO|netlink_received           0.0/sec     0.000/sec        0.0000/sec   total: 16
2021-11-19T18:10:09.013Z|00046|coverage|INFO|netlink_recv_jumbo         0.0/sec     0.000/sec        0.0000/sec   total: 3
2021-11-19T18:10:09.013Z|00047|coverage|INFO|netlink_sent               0.0/sec     0.000/sec        0.0000/sec   total: 16
2021-11-19T18:10:09.013Z|00048|coverage|INFO|cmap_expand                0.0/sec     0.000/sec        0.0000/sec   total: 3
2021-11-19T18:10:09.013Z|00049|coverage|INFO|107 events never hit
2021-11-19T18:10:31.001Z|00004|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: connection closed by peer
2021-11-19T18:10:32.039Z|00050|memory|INFO|293212 kB peak resident set size after 23.5 seconds
2021-11-19T18:10:32.059Z|00005|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: connecting...
2021-11-19T18:10:32.059Z|00006|rconn(ovn_pinctrl0)|WARN|unix:/run/openvswitch/br-int.mgmt: connection failed (Connection refused)
2021-11-19T18:10:32.059Z|00007|rconn(ovn_pinctrl0)|INFO|unix:/run/openvswitch/br-int.mgmt: waiting 2 seconds before reconnect

I couldn’t get any source of the problem except to find the steps to reproduce.
Can somebody please take a look on this?
This looks like a potential serious problem for OVN transport nodes.

Thanks.

Regards,
Vladislav Odintsov



More information about the dev mailing list