[ovs-discuss] Restarting network kills ovs-vswitchd (and network)... ?

SCHAER Frederic frederic.schaer at cea.fr
Fri May 17 09:45:36 UTC 2019


Hi

Thank you for your answer.
I actually forgot to say I already had checked the syslogs, the ovs and the network journals/logs... no coredump reference anywhere

For me a core dump or a crash would not return an exit code of 0, which seems to be what system saw :/
I even straced -f the ovs-vswitchd process  and made it stop/crash with an ifdown/ifup, but looks to me this is an exit ...

(I can retry and save the strace output if necessary or usefull)
End of strace output was (I see "brflat" in the long strings, which is the bridge hosting em1) :

[pid 175068] sendmsg(18, {msg_name(0)=NULL, msg_iov(1)=[{",\0\0\0\22\0\1\0\223\6\0\0!\353\377\377\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\3\0brflat\0\0", 44}], msg_
controllen=0, msg_flags=0}, 0 <unfinished ...>
[pid 175233] <... futex resumed> )      = 0
[pid 175068] <... sendmsg resumed> )    = 44
[pid 175068] recvmsg(18,  <unfinished ...>
[pid 175234] futex(0x55b8aaa19128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
...skipping...
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 175068] <... sendmsg resumed> )    = 44
[pid 175234] <... futex resumed> )      = 0
[pid 175233] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 175068] recvmsg(18,  <unfinished ...>
[pid 175234] futex(0x7f7f226b9140, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 175068] <... recvmsg resumed> {msg_name(0)=NULL, msg_iov(2)=[{"\360\4\0\0\20\0\0\0\224\6\0\0!\353\377\377\0\0\1\0\36\0\0\0C\20\1\0\0\0\0\0\v\0\3\0brflat\0\0\10\0\r\0\350\3\0\0\5\0\20\0\0\0\0\0\5\0\21\0\0\0\0\0\10\0\4\0\334\5\0\0\10\0\33\0\0\0\0\0\10\0\36\0\1\0\0\0\10\0\37\0\1\0\0\0\10\0(\0\377\377\0\0\10\0)\0\0\0\1\0\10\0 \0\1\0\0\0\5\0!\0\1\0\0\0\f\0\6\0noqueue\0\10\0#\0\0\0\0\0\5\0'\0\0\0\0\0$\0\16\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0H\234\377\377\n\0\1\0\276J\307\307\207I\0\0\n\0\2\0\377\377\377\377\377\377\0\0\304\0\27\0Y\22\5\0\0\0\0\0Uf\0\0\0\0\0\0^0j\1\0\0\0\0\372\371k\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0d\0\7\0Y\22\5\0Uf\0\0^0j\1\372\371k\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0y0\2\0\0\0\0\0\256\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\214\254\235\0\0\0\0\0 z\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\211\223\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0004\0\6\0\6\0\0\0\0\0\0\0r\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0A\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\7\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\10\0\0\0\0\0", 65536}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 1264
[pid 175234] <... futex resumed> )      = -1 EAGAIN (Resource temporarily unavailable)
[pid 175233] <... futex resumed> )      = 0
[pid 175234] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 175068] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
[pid 175234] <... futex resumed> )      = 0
[pid 175068] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 175068] tgkill(175068, 175068, SIGABRT <unfinished ...>
[pid 175233] futex(0x7f7f226b9140, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 175068] <... tgkill resumed> )     = 0
[pid 175233] <... futex resumed> )      = 0
[pid 175068] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=175068, si_uid=393} ---
[pid 189862] +++ killed by SIGABRT +++
[pid 175237] +++ killed by SIGABRT +++
[pid 175236] +++ killed by SIGABRT +++
[pid 175235] +++ killed by SIGABRT +++
[pid 175234] +++ killed by SIGABRT +++
[pid 175233] +++ killed by SIGABRT +++
[pid 175232] +++ killed by SIGABRT +++
[pid 175231] +++ killed by SIGABRT +++
[pid 175230] +++ killed by SIGABRT +++
[pid 175229] +++ killed by SIGABRT +++
[pid 175228] +++ killed by SIGABRT +++
[pid 175227] +++ killed by SIGABRT +++
[pid 175226] +++ killed by SIGABRT +++
[pid 175225] +++ killed by SIGABRT +++
[pid 175224] +++ killed by SIGABRT +++
[pid 175223] +++ killed by SIGABRT +++
[pid 175222] +++ killed by SIGABRT +++
[pid 175085] +++ killed by SIGABRT +++
+++ killed by SIGABRT +++


Regards

> -----Message d'origine-----
> De : Flavio Leitner <fbl at sysclose.org>
> Envoyé : vendredi 17 mai 2019 10:29
> À : SCHAER Frederic <frederic.schaer at cea.fr>
> Cc : bugs at openvswitch.org
> Objet : Re: [ovs-discuss] Restarting network kills ovs-vswitchd (and
> network)... ?
> 
> On Thu, May 16, 2019 at 09:34:28AM +0000, SCHAER Frederic wrote:
> > Hi,
> > I'm facing an issue with openvswitch, which I think is new (not even sure).
> > here is the description :
> >
> > * What you did that make the problem appear.
> >
> > I am configuring openstack (compute, network) nodes using OVS networks
> for main interfaces and RHEL network scripts, basically using openvswitch to
> create bridges, set the bridges IPs, and include the real Ethernet devices in
> the bridges.
> > On a compute machine (not in production, so not using 3 or more
> interfaces), I have for instance brflat -> em1.
> > Brflat has multiple IPs defined using IPADDR1, IPADDR2, etc..
> > Now : at boot, machine has network. Bur if I ever change anything in
> network scripts and issue either a network restart, an ifup or an ifdown :
> network breaks and connectivity is lost.
> >
> > Also, on network restarts, I'm getting these logs in the network journal :
> > May 16 10:26:41 cloud1 ovs-vsctl[1766678]: ovs|00001|vsctl|INFO|Called
> > as ovs-vsctl -t 10 -- --may-exist add-br brflat May 16 10:26:51 cloud1
> > ovs-vsctl[1766678]: ovs|00002|fatal_signal|WARN|terminating with
> > signal 14 (Alarm clock) May 16 10:26:51 cloud1 network[1766482]:
> > Bringing up interface brflat:
> > 2019-05-16T08:26:51Z|00002|fatal_signal|WARN|terminating with signal
> > 14 (Alarm clock)
> >
> > * What you expected to happen.
> >
> > On network restart... to get back a working network. Not be forced to log in
> using ipmi console and fix network manually.
> >
> > * What actually happened.
> >
> > What actually happens is that on ifup/ifdown/network restart, the ovs-
> vswitchd daemon stops working. According to systemctl, it is actually exiting
> with code 0.
> > If I do a ifdown on one interface, then ovs-vswitchd is down.
> > After ovs-vswitchd restart, I then can ifup that interface : network is still
> down (no ping, nothing).
> > Ovs-vswitchd is again dead/stopped/exited 0.
> > Then : manually starting ovs-vswitchd restores connectivity.
> >
> > Please also include the following information:
> > * The Open vSwitch version number (as output by ovs-vswitchd --version).
> > ovs-vswitchd (Open vSwitch) 2.10.1
> 
> Sounds like OVS is crashing. Please check 'dmesg' if you see segmentation
> fault messages in there. Or the journal logs.
> Or the systemd service status.
> 
> If it is, then the next step is to enable coredumps to grab one core. Then
> install openvswitch-debuginfo package to see the stack trace.
> 
> You're right that ifdown should not put the service down.
> 
> fbl


More information about the discuss mailing list