[ovs-dev] ovn-controller appears to be in a tight loop

Kyle Mestery mestery at mestery.com
Mon Jan 11 02:39:00 UTC 2016


On Sun, Jan 10, 2016 at 7:34 PM, Kyle Mestery <mestery at mestery.com> wrote:

> Greetings:
>
> I'm running the latest OVN/OVS code from last Friday with the latest
> OpenStack code from master from the same time period. I've setup a 5-node
> OpenStack clluster using the devstack scripts from the networking-ovn
> repository. This means that the OVSDB instance on the all-in-one OpenStack
> node is running the OVN DBs as well. The nodes all have 56 CPU cores and
> 256 GB of RAM.
>
> Last week as part of some scale testing, we did the following:
>
> 1. Create 100 logical networks in a single OpenStack tenant.
> 2. Boot 2 VMs per logical network.
>
> The system was in perfect working order and the VMs were spread across all
> 5 nodes.
>
> We next created an additional 100 logical networks in OpenStack, and then
> created 100 routers, attaching 2 of the 200 networks to each router. At
> this point, we left the system sit all weekend.
>
> I looked at the system today and I noticed that ovn-controller on the
> all-in-one OpenStack node was spinning at 100% CPI. In the screen session,
> I see the following being dumped from it:
>
> 2016-01-11T01:31:55Z|70989|jsonrpc|WARN|tcp:10.139.196.55:6640: send
> error: Broken pipe
> 2016-01-11T01:31:55Z|70990|reconnect|WARN|tcp:10.139.196.55:6640:
> connection dropped (Broken pipe)
> 2016-01-11T01:31:56Z|70991|reconnect|INFO|tcp:10.139.196.55:6640:
> connecting...
> 2016-01-11T01:31:57Z|70992|poll_loop|INFO|Dropped 29 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:31:57Z|70993|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (101%
> CPU usage)
> 2016-01-11T01:31:57Z|70994|reconnect|INFO|tcp:10.139.196.55:6640:
> connected
> 2016-01-11T01:32:03Z|70995|poll_loop|INFO|Dropped 34 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:03Z|70996|poll_loop|INFO|wakeup due to [POLLIN] on fd 11
> (<->/usr/local/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:155
> (101% CPU usage)
> 2016-01-11T01:32:09Z|70997|poll_loop|INFO|Dropped 35 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:09Z|70998|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (100%
> CPU usage)
> 2016-01-11T01:32:15Z|70999|poll_loop|INFO|Dropped 32 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:15Z|71000|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (102%
> CPU usage)
> 2016-01-11T01:32:21Z|71001|poll_loop|INFO|Dropped 32 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:21Z|71002|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (100%
> CPU usage)
> 2016-01-11T01:32:27Z|71003|poll_loop|INFO|Dropped 34 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:27Z|71004|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (102%
> CPU usage)
> 2016-01-11T01:32:33Z|71005|poll_loop|INFO|Dropped 32 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:33Z|71006|poll_loop|INFO|wakeup due to [POLLIN] on fd 11
> (<->/usr/local/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:155
> (102% CPU usage)
> 2016-01-11T01:32:39Z|71007|poll_loop|INFO|Dropped 36 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:39Z|71008|poll_loop|INFO|wakeup due to [POLLIN] on fd 9 (
> 10.139.196.55:34440<->10.139.196.55:6640) at lib/stream-fd.c:155 (103%
> CPU usage)
> 2016-01-11T01:32:45Z|71009|poll_loop|INFO|Dropped 32 log messages in last
> 5 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:45Z|71010|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (100%
> CPU usage)
> 2016-01-11T01:32:51Z|71011|poll_loop|INFO|Dropped 30 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:51Z|71012|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (100%
> CPU usage)
> 2016-01-11T01:32:57Z|71013|poll_loop|INFO|Dropped 32 log messages in last
> 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2016-01-11T01:32:57Z|71014|poll_loop|INFO|wakeup due to [POLLIN] on fd 8
> (<->/usr/local/var/run/openvswitch/db.sock) at lib/stream-fd.c:155 (101%
> CPU usage)
>
> Looking at strace, I see the following:
>
> brk(0)                                  = 0x98cf000
> brk(0x98f0000)                          = 0x98f0000
> brk(0)                                  = 0x98f0000
> brk(0)                                  = 0x98f0000
> brk(0x98da000)                          = 0x98da000
> brk(0)                                  = 0x98da000
> brk(0)                                  = 0x98da000
> brk(0)                                  = 0x98da000
> brk(0x98cf000)                          = 0x98cf000
> brk(0)                                  = 0x98cf000
> accept(5, 0x7ffc1a792360, [128])        = -1 EAGAIN (Resource temporarily
> unavailable)
> poll([{fd=11, events=POLLIN}, {fd=6, events=POLLIN}, {fd=10,
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=5,
> events=POLLIN}], 6, 1997) = 2 ([{fd=8, revents=POLLIN}, {fd=9,
> revents=POLLIN}])
> getrusage(0x1 /* RUSAGE_??? */, {ru_utime={199581, 410889}, ru_stime={454,
> 342812}, ...}) = 0
> recvfrom(9, " !ct.new && !ct.inv\",\"pipeline\":"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "\":\"ct_commit; next;\",\"external_i"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "7 && udp.dst == 68\",\"pipeline\":\""..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "-7eb25a6317b0\\\"; output;\",\"exter"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "ess\",\"logical_datapath\":[\"uuid\","..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "1-8604-e46a009b0cdb\\\" && ip4 && "..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "fc-954b-4df0e97196f2\\\",\\\"3a13f6f"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "\\\",\\\"7a3d8dae-a089-4ec3-ad68-7ee"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "e9c-810f-5e8a995a3166\\\",\\\"c470e3"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "ath\":[\"uuid\",\"4459e36d-254f-4378"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "(inport == \\\"3c6971b4-fd63-48d1-"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "\"priority\":50,\"actions\":\"next;\","..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "b59943c1\\\",\\\"0e82f9cd-4427-4138-"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "c-fda7-4fcd-b028-7b86acad7d7c\\\","..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "1880eb668\\\",\\\"3c6971b4-fd63-48d1"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "95-9345-4f58-87f5-3aaa6af88408\\\""..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "e79b865a08\\\",\\\"6ce53eec-2a7b-4fa"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "e50-3723-4df3-ba22-204c9e82703a\\"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "46a009b0cdb\\\",\\\"8f3634ed-525e-42"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "004c-0ddf-4024-b8c5-dbced31accda"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "05805c27d270\\\",\\\"d3a46ef8-ce4d-4"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "c1540-bcb9-4a94-9ce1-73b322a8b11"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "-19497dffcd4f\\\",\\\"fc77809a-3313-"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "s\":\"drop;\",\"external_ids\":[\"map\""..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "ine\":\"egress\",\"logical_datapath\""..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "\"pipeline\":\"ingress\",\"logical_da"..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, ":\"egress\",\"logical_datapath\":[\"u"..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "1b:9b:a9; arp.tpa = arp.spa; arp"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, ":{\"match\":\"ct.new && (outport =="..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "9cd-4427-4138-97ad-b9e78924ad9d\\"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "548b0e39611\\\",\\\"21567b0a-3b52-49"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "bbeb-caa8-42e6-ba44-64ad9eaa6f0a"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "b1bc941a4c95\\\",\\\"45600170-9efc-4"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "4f521-3b42-49e0-8006-d00210f44d3"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "-416a9f8218a0\\\",\\\"5f9a58de-f858-"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "28258f-aace-4945-a8c9-c15f8f1ca1"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "4-f658960f1c60\\\",\\\"7a3d8dae-a089"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "614ec4d-5331-4aa7-96b3-c439cb1e3"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "7a-c9e60461580e\\\",\\\"96ff7a31-832"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "b114d781-6047-4121-b46d-5d8bc948"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "ce5-13612c793cc9\\\",\\\"c716c791-75"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "\"d512ca88-ab67-44a8-a988-d800ef9"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "9ce1-73b322a8b113\\\",\\\"e8a030c1-4"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "\\\"f8f360fc-2701-4cde-bdcf-cbffbb"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "port == \\\"3739db2d-6b23-44eb-b44"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(9, "gical_datapath\":[\"uuid\",\"0c529bb"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, ":[\"map\",[[\"stage-name\",\"lr_in_ad"..., 512, 0, NULL,
> NULL) = 512
> recvfrom(9, "ap\",[[\"stage-name\",\"ls_out_pre_a"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, "gress\",\"logical_datapath\":[\"uuid"..., 512, 0, NULL, NULL)
> = 512
> recvfrom(9, " == \\\"a5ac0686-bf49-40c2-888a-4c"..., 512, 0, NULL, NULL) =
> 512
> recvfrom(8, "{\"id\":null,\"method\"", 19, 0, NULL, NULL) = 19
> recvfrom(8, ":\"update2\",\"params\":[null,{\"Port"..., 512, 0, NULL,
> NULL) = 512
> recvfrom(8, "odify\":{\"ports\":[\"uuid\",\"44948ac"..., 512, 0, NULL,
> NULL) = 369
> recvfrom(8, 0x1bf18b1, 143, 0, 0, 0)    = -1 EAGAIN (Resource temporarily
> unavailable)
> recvfrom(10, 0x449e370, 8, 0, 0, 0)     = -1 EAGAIN (Resource temporarily
> unavailable)
> recvfrom(11, 0x6ea6680, 8, 0, 0, 0)     = -1 EAGAIN (Resource temporarily
> unavailable)
>
> The above logs loop over and over.
>
> I'm curios if anyone has any ideas on what might be going wrong here, or
> if there is any other information I should collect at this point. I plan to
> do even heavier scale testing of OVN in the coming weeks, hopefully going
> to 10x compute nodes.
>
>
One other datapoint:

The kernel logs are filled with these:

Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086376|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086377|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086378|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086379|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086380|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086381|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086382|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086383|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086384|netlink_socket|ERR|fcntl: Too many open files
Jan 10 20:38:00 oc-syd01-prod-compute-191 ovs-vswitchd:
ovs|110086385|netlink_socket|ERR|fcntl: Too many open files

Looks like something is leaking file descriptors perhaps.


> Thanks!
> Kyle
>



More information about the dev mailing list