[ovs-dev] Dealing with extreme kernel memory leaks

Wed Mar 30 21:08:12 UTC 2022

Hello,

I'm using OVS 2.17.0 combined with OVN 22.03.0 on Ubuntu 20.04 and a
5.17.1 mainline kernel.

I'm trying to debug a very very problematic kernel memory leak which
is happening in this environment.
Sadly I don't have a clear reproducer or much idea of when it first
appeared. I went through about a year of metrics and some amount of
leakage may always have been present, the magnitude of it just
changing recently making it such that my normal weekly server
maintenance is now not frequent enough to take care of it.

Basically what I'm doing is running a LXD + OVN setup on 3 servers,
they all act as OVN chassis with various priorities to spread the
load.
All 3 servers also normally run a combination of containers and
virtual machines attached to about a dozen different OVN networks.

What I'm seeing is about 2MB/s of memory leakage (kmalloc-256 slub)
which after enabling slub debugging can be tracked down to
nf_ct_tmpl_alloc kernel calls such as those made by the openvswitch
kernel module as part of its ovs_ct_copy_action function which is
exposed as OVS_ACTION_ATTR_CT to userspace through the openvswitch
netlink API.

This means that in just a couple of days I'm dealing with just shy of
40GiB of those kmalloc-256 entries.
Here is one of the servers which has been running for just 6 hours:

```
root at abydos:~# uptime
 20:51:44 up  6:32,  1 user,  load average: 6.63, 5.95, 5.26

root at abydos:~# slabtop -o -s c | head -n10
 Active / Total Objects (% used)    : 24919212 / 25427299 (98.0%)
 Active / Total Slabs (% used)      : 541777 / 541777 (100.0%)
 Active / Total Caches (% used)     : 150 / 197 (76.1%)
 Active / Total Size (% used)       : 11576509.49K / 11680410.31K (99.1%)
 Minimum / Average / Maximum Object : 0.01K / 0.46K / 50.52K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
13048822 13048737  99%    0.75K 310688       42   9942016K kmalloc-256
2959671 2677549  90%    0.10K  75889       39    303556K buffer_head
460411 460411 100%    0.57K  16444       28    263104K radix_tree_node

root at abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/alloc_traces |
sort -rn | head -n5
12203895 nf_ct_tmpl_alloc+0x55/0xb0 [nf_conntrack]
age=57/2975485/5871871 pid=1964-2048 cpus=0-31 nodes=0-1
 803599 metadata_dst_alloc+0x25/0x50 age=2/2663973/5873408
pid=0-683331 cpus=0-31 nodes=0-1
  32773 memcg_alloc_slab_cgroups+0x3d/0x90 age=386/4430883/5878515
pid=1-731302 cpus=0-31 nodes=0-1
   3861 do_seccomp+0xdb/0xb80 age=749613/4661870/5878386
pid=752-648826 cpus=0-31 nodes=0-1
   2314 device_add+0x504/0x920 age=751269/5665662/5883377 pid=1-648698
cpus=0-31 nodes=0-1

root at abydos:~# cat /sys/kernel/debug/slab/kmalloc-256/free_traces |
sort -rn | head -n5
8129152 <not-available> age=4300785451 pid=0 cpus=0 nodes=0-1
2770915 reserve_sfa_size+0xdf/0x110 [openvswitch]
age=1912/2970717/5881994 pid=1964-2069 cpus=0-31 nodes=0-1
1621182 dst_destroy+0x70/0xd0 age=4/3065853/5883592 pid=0-733033
cpus=0-31 nodes=0-1
 288686 nf_ct_tmpl_free+0x1b/0x30 [nf_conntrack]
age=109/2985710/5879968 pid=0-733208 cpus=0-31 nodes=0-1
 134435 ovs_nla_free_flow_actions+0x68/0x90 [openvswitch]
age=134/2955781/5883717 pid=0-733208 cpus=0-31 nodes=0-1
```

Here you can see 12M calls to nf_ct_tmpl_alloc but just 288k to nf_ct_tmpl_free.

Things I've done so far to try to isolate things:

1) I've evacuated all workloads from the server, so the only thing
running on it is OVS vswitchd. This did not change anything.
2) I've added iptables/ip6tables raw table rules marking all traffic
as NOTRACK. This did not change anything.
3) I've played with the chassis assignment. This does change things in
that a server with no active chassis will not show any leakage
(thankfully). The busier the network I move back to the host, the
faster the leakage.

I've had both Frode and Tom (Cced) assist with a variety of ideas and
questions but while we have found some unrelated OVS and kernel
issues, we're yet to figure this one out. So I wanted to reach out to
the wider community to see if anyone has either seen something like
this before or has suggestions as to where to look next.

I can pretty easily rebuild the kernel, OVS or OVN and while this
cluster is a production environment, the fact that I can evacuate one
of the three servers with no user-visible impact makes it not too bad
to debug. Having to constantly reboot the entire setup to clear the
memory leak is the bigger annoyance right now :)

Stéphane

PS: The side OVS/kernel issue I'm referring to is
https://lore.kernel.org/netdev/20220330194244.3476544-1-stgraber@ubuntu.com/
which allowed us to track down an issue with OVN logical routers
properly responding to ICMP on their external address but getting into
a recirculation loop when any other kind of traffic is thrown at them
(instead of being immediately dropped or rejected). The kernel change
made it possible to track this down.