Just to give a solution for the archives, we ended up upgrading the 
kernel from the 2.6.32 EL6 kernel to the 3.10 kernel-lt elrepo kernel, 
and have not seen a recurrence of the ovs lockup.


On 02/28/2014 03:00 PM, Jeff Bachtel wrote:
> Does anyone have any insight into this? For further datapoints, I 
> built the 2.0 release and much more current openvswitch snapshots 
> (most recently to commit bdeadfdd) which exhibited the same problems. 
> The CentOS 6 kernel is 2.6.32. Because of presumed incompatibility 
> with the Linux bridge module, I made sure bridge.o wasn't being 
> loaded. On a host where ovsdb-server had not yet become unresponsive, 
> ovs-vswitchd was unkillable, in state R<L. Could my problem be related 
> to vwsitchd becoming unresponsive under load, taking ovsdb-server with 
> it?
> I've received further confirmation that this is involved in some way 
> with load, as a node inadvertently disconnected from the rest of the 
> Ceph cluster had a record uptime with openvswitch. If anyone can give 
> me pointers on getting a backtrace I'm happy to run things until 
> failure and get better data. I've had trouble with this at least as 
> far as using strace is concerned. As it is, I've cron'd a restart of 
> openvswitch every minute - obviously an incredibly unideal situation.
> Thanks for any help,
> Jeff
> On 02/20/2014 12:54 AM, Jeff Bachtel wrote:
>> I'm running OpenVSwitch 1.11 from the RDO Havana repository. In 
>> addition, I'm running OpenStack Havana, Neutron, and Ceph Emperor, 
>> all on some CentOS 6.5 machines.
>> After installing Bacula on the previous openstack version (grizzly), 
>> I noticed the networking had become somewhat load sensitive. 
>> ovsdb-server was freezing - not responding to queries on its unix 
>> socket and becoming unkillable in process state R< . Believing that 
>> it was probably due to being behind in ovs version, I pushed ahead 
>> with an upgrade only to find my stability problems become much much 
>> worse. Every 20-30 minutes I can count on an ovsdb-server process 
>> freezing.
>> At 
>> https://drive.google.com/folderview?id=0B-wx2_T_hW-_OXZJWGJNc0l0MzQ&usp=sharing 
>> please find a folder with shared copies of diagnostic files from a 
>> machine with hung ovsdb-server. There is a process list (.ps, 
>> apologies forgot postscript until upload was done), strace, dmesg, 
>> and /var/log/messages.
>> The strace didn't reveal anything suspicious to me. To mitigate I 
>> tried lowering log verbosity, completely recreating conf.db, as well 
>> as frequent compacting (every minute) and putting the db on a 
>> ramdisk, nothing worked as a solution.
>> The ovsdb-server processes most likely to succumb to locking run on 
>> ceph hosts running osd - meaning they can see a lot of network 
>> traffic, as well as disk i/o.
>> I don't understand what a simple database RPC server could be doing 
>> that would cause it to become unkillable, especially with the attempt 
>> at minimizing disk i/o by putting the db file on a ramdisk.
>> I hope someone has some ideas of what I might do to test or mitigate 
>> the situation. Not running ceph osd on the hosts is, unfortunately, 
>> not a solution I can use.
>> Thanks,
>> Jeff

