[ovs-dev] [PATCH] netlink-socket: Check for null sock in nl_sock_recv__()

David Christensen drc at linux.vnet.ibm.com
Thu Nov 18 21:19:49 UTC 2021



On 11/18/21 11:56 AM, Murilo Opsfelder Araújo wrote:
> On 11/16/21 19:31, Ilya Maximets wrote:
>> On 10/25/21 19:45, David Christensen wrote:
>>> In certain high load situations, such as when creating a large number of
>>> ports on a switch, the parameter 'sock' may be passed to 
>>> nl_sock_recv__()
>>> as null, resulting in a segmentation fault when 'sock' is later
>>> dereferenced, such as when calling recvmsg().
>>
>> Hi, David.  Thanks for the patch.
>>
>> It's OK to check for a NULL pointer there, I guess.  However,
>> do you know from where it was actually called?  This function,
>> in general, should not be called without the actual socket,
>> so we, probably, should fix the caller instead.
>>
>> Best regards, Ilya Maximets.
> 
> Hi, Ilya Maximets.
> 
> When I looked at the coredump file, ch->sock was nil and was passed to 
> nl_sock_recv():
> 
> (gdb) l
> 2701
> 2702        while (handler->event_offset < handler->n_events) {
> 2703            int idx = 
> handler->epoll_events[handler->event_offset].data.u32;
> 2704            struct dpif_channel *ch = &dpif->channels[idx];
> 
> (gdb) p idx
> $26 = 4
> (gdb) p *dpif->channels at 5
> $27 = {{sock = 0x1001ae88240, last_poll = -9223372036854775808}, {sock = 
> 0x1001aa9a8a0, last_poll = -9223372036854775808}, {sock = 0x1001ae09510, 
> last_poll = 60634070}, {sock = 0x1001a9dbb60, last_poll = 60756950}, 
> {sock = 0x0,
>      last_poll = 61340749}}
> 
> 
> The above snippet is from lib/dpif-netlink.c and the caller is 
> dpif_netlink_recv_vport_dispatch().
> 
> The channel at idx=4 had sock=0x0, which was passed to nl_sock_recv() 
> via ch->sock parameter.
> In that function, it tried to access sock->fd when calling recvmsg(), 
> causing the segfault.
> 
> I'm not enough experienced in Open vSwitch to explain why sock was nil 
> at that given index.
> The fix seems worth, though.

A few other points of note:

- Test system was a very large configuration (2K CPUs, > 1TB RAM)
- OVS Switch was configured with 6K ports as follows:

# b=br0; cmds=; for i in {1..6000}; do cmds+=" -- add-port $b p$i -- set 
interface p$i type=internal"; done
# time sudo ovs-vsctl $cmds

- OVS was installed from RHEL RPM.  Build from source did not exhibit 
the same problem.
- Unable to reproduce on a different system (128 CPUs, 256GB RAM), even 
with 10K ports.

Dave


More information about the dev mailing list