[ovs-discuss] ovsdb-server unkillable, need some help
jbachtel at bericotechnologies.com
Mon Mar 3 20:56:33 UTC 2014
Just to give a solution for the archives, we ended up upgrading the
kernel from the 2.6.32 EL6 kernel to the 3.10 kernel-lt elrepo kernel,
and have not seen a recurrence of the ovs lockup.
On 02/28/2014 03:00 PM, Jeff Bachtel wrote:
> Does anyone have any insight into this? For further datapoints, I
> built the 2.0 release and much more current openvswitch snapshots
> (most recently to commit bdeadfdd) which exhibited the same problems.
> The CentOS 6 kernel is 2.6.32. Because of presumed incompatibility
> with the Linux bridge module, I made sure bridge.o wasn't being
> loaded. On a host where ovsdb-server had not yet become unresponsive,
> ovs-vswitchd was unkillable, in state R<L. Could my problem be related
> to vwsitchd becoming unresponsive under load, taking ovsdb-server with
> I've received further confirmation that this is involved in some way
> with load, as a node inadvertently disconnected from the rest of the
> Ceph cluster had a record uptime with openvswitch. If anyone can give
> me pointers on getting a backtrace I'm happy to run things until
> failure and get better data. I've had trouble with this at least as
> far as using strace is concerned. As it is, I've cron'd a restart of
> openvswitch every minute - obviously an incredibly unideal situation.
> Thanks for any help,
> On 02/20/2014 12:54 AM, Jeff Bachtel wrote:
>> I'm running OpenVSwitch 1.11 from the RDO Havana repository. In
>> addition, I'm running OpenStack Havana, Neutron, and Ceph Emperor,
>> all on some CentOS 6.5 machines.
>> After installing Bacula on the previous openstack version (grizzly),
>> I noticed the networking had become somewhat load sensitive.
>> ovsdb-server was freezing - not responding to queries on its unix
>> socket and becoming unkillable in process state R< . Believing that
>> it was probably due to being behind in ovs version, I pushed ahead
>> with an upgrade only to find my stability problems become much much
>> worse. Every 20-30 minutes I can count on an ovsdb-server process
>> please find a folder with shared copies of diagnostic files from a
>> machine with hung ovsdb-server. There is a process list (.ps,
>> apologies forgot postscript until upload was done), strace, dmesg,
>> and /var/log/messages.
>> The strace didn't reveal anything suspicious to me. To mitigate I
>> tried lowering log verbosity, completely recreating conf.db, as well
>> as frequent compacting (every minute) and putting the db on a
>> ramdisk, nothing worked as a solution.
>> The ovsdb-server processes most likely to succumb to locking run on
>> ceph hosts running osd - meaning they can see a lot of network
>> traffic, as well as disk i/o.
>> I don't understand what a simple database RPC server could be doing
>> that would cause it to become unkillable, especially with the attempt
>> at minimizing disk i/o by putting the db file on a ramdisk.
>> I hope someone has some ideas of what I might do to test or mitigate
>> the situation. Not running ceph osd on the hosts is, unfortunately,
>> not a solution I can use.
More information about the discuss