[ovs-dev] [OVS 1.4.6] UNIX domain socket problems

Zoltan Kiss zoltan.kiss at citrix.com
Tue May 27 20:09:10 UTC 2014


Hi,

Recently we have seen a pool-wide storage fence in a XenServer 6.2 SP1 
enviroment. OVS 1.4.6+ were in use, up until the commit e58f54a4aa1 on 
that branch.
In a 20 minutes period all 4 hosts in the pool lost network connectivity 
and HA restarted them automatically. First ovsdb-server logged heartbeat 
issues on the UNIX domain socket, shortly after iSCSI connection started 
to fail, and finally ovs-vswitchd reported that the UNIX domain socket 
is broken, always about 30 seconds after ovsdb.
It happened in the same way on each host, quite near to each other, and 
after restart they were stable for an hour, but then the problems 
started again, so the network backend were reverted to bridge.
OOMkiller haven't started, so probably there were enough memory, and 
there is no sign the disk was filled up either.
Prior to this issue I've seen these kind of messages often in the log, 
but I'm not sure if they have anything to do about it:

May 15 16:22:47 host ovs-vswitchd: 21539|dpif|WARN|system at xapi3: recv 
failed (No buffer space available)
May 15 16:22:47 host ovs-vswitchd: 21540|dpif|WARN|system at xapi3: recv 
failed (No buffer space available)

I'm including some log excerpts from one server:

May 15 16:22:58 host ovsdb-server: 
00023|reconnect|ERR|unix:/tmp/stream-unix.5393.0: no response to 
inactivity probe after 5.01 seconds, disconnecting
May 15 16:23:11 host kernel: [3328870.137299]  connection3:0: ping 
timeout of 15 secs expired, recv timeout 10, last rx 332854265, last 
ping 332855265, now 332856765
May 15 16:23:11 host kernel: [3328870.137323]  connection3:0: detected 
conn error (1011)
May 15 16:23:12 host iscsid: Kernel reported iSCSI connection 3:0 error 
(1011) state (3)
May 15 16:23:12 host kernel: [3328871.019145]  connection2:0: ping 
timeout of 15 secs expired, recv timeout 10, last rx 332854353, last 
ping 332855353, now 332856853
May 15 16:23:12 host kernel: [3328871.019167]  connection2:0: detected 
conn error (1011)
May 15 16:23:13 host iscsid: Kernel reported iSCSI connection 2:0 error 
(1011) state (3)
...
May 15 16:23:27 host ovs-vswitchd: 
21541|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: send error: Broken 
pipe
May 15 16:23:27 host ovs-vswitchd: 21542|timeval|WARN|40509 ms poll 
interval (0 ms user, 32590 ms system) is over 4324 times the weighted 
mean interval 2 ms (52404048 samples)
May 15 16:23:27 host ovs-vswitchd: 21543|timeval|WARN|context switches: 
0 voluntary, 879 involuntary
May 15 16:23:27 host ovs-vswitchd: 21544|coverage|INFO|Event coverage 
(epoch 52404048/entire run), hash=37fe45f6:
May 15 16:23:27 host ovs-vswitchd: 
21545|coverage|INFO|ofproto_dpif_xlate        4617 / 493556267
May 15 16:23:27 host ovs-vswitchd: 21546|coverage|INFO|facet_revalidate 
             1 /   3332257
May 15 16:23:27 host ovs-vswitchd: 21547|coverage|INFO|dpif_flow_del 
          1786 /  91737202
May 15 16:23:27 host ovs-vswitchd: 21548|coverage|INFO|flow_extract 
           150 /  99810556
May 15 16:23:27 host ovs-vswitchd: 21549|coverage|INFO|hmap_expand 
            53 / 828100019
May 15 16:23:27 host ovs-vswitchd: 
21550|coverage|INFO|mac_learning_learned         1 /   1950471
May 15 16:23:27 host ovs-vswitchd: 
21551|coverage|INFO|mac_learning_expired        33 /   1949928
May 15 16:23:27 host ovs-vswitchd: 21552|coverage|INFO|netdev_get_stats 
            15 /   7983394
May 15 16:23:27 host ovs-vswitchd: 21553|coverage|INFO|poll_fd_wait 
            10 / 577246726
May 15 16:23:27 host ovs-vswitchd: 21554|coverage|INFO|poll_zero_timeout 
            1 /    537392
May 15 16:23:27 host ovs-vswitchd: 21555|coverage|INFO|unixctl_received 
             1 /    663108
May 15 16:23:27 host ovs-vswitchd: 21556|coverage|INFO|unixctl_replied 
             1 /    663108
May 15 16:23:27 host ovs-vswitchd: 21557|coverage|INFO|util_xalloc 
         23796 / 22430418786
May 15 16:23:27 host ovs-vswitchd: 21558|coverage|INFO|netdev_ethtool 
            30 /  15967005
May 15 16:23:27 host ovs-vswitchd: 21559|coverage|INFO|netlink_received 
          3888 / 401972766
May 15 16:23:27 host ovs-vswitchd: 
21560|coverage|INFO|netlink_recv_jumbo          92 /  31719860
May 15 16:23:27 host ovs-vswitchd: 21561|coverage|INFO|netlink_sent 
          1983 / 303430371
May 15 16:23:27 host ovs-vswitchd: 
21562|coverage|INFO|bridge_reconfigure           0 /        57
May 15 16:23:27 host ovs-vswitchd: 21563|coverage|INFO|ofproto_flush 
             0 /         1
May 15 16:23:27 host ovs-vswitchd: 
21564|coverage|INFO|ofproto_recv_openflow        0 /        82
May 15 16:23:27 host ovs-vswitchd: 
21565|coverage|INFO|ofproto_update_port          0 /       175
May 15 16:23:27 host ovs-vswitchd: 21566|coverage|INFO|facet_unexpected 
             0 /      9856
May 15 16:23:27 host ovs-vswitchd: 21567|coverage|INFO|dpif_port_add 
             0 /        34
May 15 16:23:27 host ovs-vswitchd: 21568|coverage|INFO|dpif_flow_flush 
             0 /         2
May 15 16:23:27 host ovs-vswitchd: 21569|coverage|INFO|dpif_flow_put 
             0 /    302480
May 15 16:23:27 host ovs-vswitchd: 21570|coverage|INFO|dpif_execute 
             0 /         2
May 15 16:23:27 host ovs-vswitchd: 21571|coverage|INFO|dpif_purge 
             0 /         1
May 15 16:23:27 host ovs-vswitchd: 21572|coverage|INFO|hmap_pathological 
            0 /      2439
May 15 16:23:27 host ovs-vswitchd: 21573|coverage|INFO|rconn_queued 
             0 /        41
May 15 16:23:27 host ovs-vswitchd: 21574|coverage|INFO|rconn_sent 
             0 /        41
May 15 16:23:27 host ovs-vswitchd: 21575|coverage|INFO|pstream_open 
             0 /         2
May 15 16:23:27 host ovs-vswitchd: 21576|coverage|INFO|stream_open 
             0 /         1
May 15 16:23:27 host ovs-vswitchd: 21577|coverage|INFO|vconn_received 
             0 /       123
May 15 16:23:27 host ovs-vswitchd: 21578|coverage|INFO|vconn_sent 
             0 /        82
May 15 16:23:27 host ovs-vswitchd: 
21579|coverage|INFO|netdev_set_policing          0 /       601
May 15 16:23:27 host ovs-vswitchd: 
21580|coverage|INFO|netdev_get_ifindex           0 /        71
May 15 16:23:27 host ovs-vswitchd: 21581|coverage|INFO|netdev_get_hwaddr 
            0 /       155
May 15 16:23:27 host ovs-vswitchd: 21582|coverage|INFO|netdev_set_hwaddr 
            0 /         3
May 15 16:23:27 host ovs-vswitchd: 21583|coverage|INFO|nln_changed 
             0 /       312
May 15 16:23:27 host ovs-vswitchd: 21584|coverage|INFO|netlink_overflow 
             0 /       127
May 15 16:23:27 host ovs-vswitchd: 21585|coverage|INFO|38 events never hit
May 15 16:23:27 host ovs-vswitchd: 21586|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 
21587|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection 
dropped (Broken pipe)
May 15 16:23:27 host ovs-vswitchd: 21588|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21589|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21590|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21591|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21592|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21593|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21594|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21595|poll_loop|WARN|wakeup due to 
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at 
lib/dpif-linux.c:1197 (76% CPU usage)
...
[these kept coming for a while before restart]

Anyone has any idea where to look further?

Regards,

Zoli



More information about the dev mailing list