[ovs-discuss] Fwd: OVS segfault in recirculation

Mon Feb 16 10:23:07 UTC 2015

Hi Andy,

I can tell you that this morning the test is still running without any 
segfault. The patch seems to fix the problem!

Thanks for your help,

Salvatore

On 13/02/15 17:18, Andy Zhou wrote:
> Great! looking forward to the test results.
>
> On Fri, Feb 13, 2015 at 3:56 AM, Salvatore Cambria
> <salvatore.cambria at citrix.com> wrote:
>> Hi Andy,
>>
>> Yes, it makes a lot of sense!
>>
>> I applied the patch and I'm running my script again in order to test it. It
>> looks good so far!
>> I will leave the script running all across the weekend and I'll come back to
>> you with my results on Monday.
>>
>> Thank you for your very fast response and for your effort,
>>
>> Salvatore
>>
>> On 13/02/15 02:10, Andy Zhou wrote:
>>> Does not mean to drop the list.
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Andy Zhou <azhou at nicira.com>
>>> Date: Thu, Feb 12, 2015 at 3:42 PM
>>> Subject: Re: [ovs-discuss] OVS segfault in recirculation
>>> To: Salvatore Cambria <salvatore.cambria at citrix.com>
>>>
>>>
>>> Hi, Salvatore,
>>>
>>> I think I found a bug: hmap_remove needs to be protected by a lock. As
>>> you pointed out, the output_normal() path is not properly protected.
>>> Would you please
>>> try the attached patch to see if it helps?
>>>
>>> Bond object is protected by reference counting.  Assuming reference
>>> counter is valid, it should be O.K. for a bundle to hold on to the
>>> object. Notice
>>> bundle_destroy() calls unref_bond().  At any rate, I did not spot
>>> anything obvious there.
>>>
>>> Thanks for providing such a detailed description. It is very helpful.
>>>
>>> Andy
>>>
>>> ================================
>>>
>>> iff --git a/ofproto/bond.c b/ofproto/bond.c
>>> index c4b3a3a..a9b7a83 100644
>>> --- a/ofproto/bond.c
>>> +++ b/ofproto/bond.c
>>> @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t
>>> *recirc_id,
>>>        }
>>>    }
>>>
>>> -void
>>> -bond_update_post_recirc_rules(struct bond* bond, const bool force)
>>> +static void
>>> +bond_update_post_recirc_rules__(struct bond* bond, const bool force)
>>>    {
>>>       struct bond_entry *e;
>>>       bool update_rules = force;  /* Always update rules if caller forces
>>> it. */
>>> @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond,
>>> const bool force)
>>>            update_recirc_rules(bond);
>>>       }
>>>    }
>>> +
>>> +void
>>> +bond_update_post_recirc_rules(struct bond* bond, const bool force)
>>> +{
>>> +    ovs_rwlock_wrlock(&rwlock);
>>> +    bond_update_post_recirc_rules__(bond, force);
>>> +    ovs_rwlock_unlock(&rwlock);
>>> +}
>>>    ^L
>>>    /* Rebalancing. */
>>>
>>> @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond)
>>>        }
>>>
>>>        if (use_recirc && rebalanced) {
>>> -        bond_update_post_recirc_rules(bond,true);
>>> +        bond_update_post_recirc_rules__(bond,true);
>>>        }
>>>
>>>    done:
>>>
>>>
>>> On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria
>>> <salvatore.cambria at citrix.com> wrote:
>>>> Hello,
>>>>
>>>> I am a Software Engineer from Citrix, Cambridge UK office.
>>>>
>>>> We are having a problem when running our nightly test on a XenServer host
>>>> with lacp bonds with OVS. Indeed, when deleting the bond the following
>>>> segfault error appears (from GDB):
>>>>
>>>> Program terminated with signal 11, Segmentation fault.
>>>>
>>>> #0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730,
>>>> hmap=0x1c3e9e8)
>>>>
>>>>       at lib/hmap.h:236
>>>>
>>>> 236         while (*bucket != node) {
>>>>
>>>> (gdb) bt
>>>>
>>>> #0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730,
>>>> hmap=0x1c3e9e8)
>>>>
>>>>       at lib/hmap.h:236
>>>>
>>>> #1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382
>>>>
>>>> #2  0x0000000000001ff5 in ?? ()
>>>>
>>>> (gdb) p *hmap
>>>>
>>>> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}
>>>>
>>>> (gdb) p *node
>>>>
>>>> $2 = {hash = 2450845263, next = 0x0}
>>>>
>>>> (gdb) p *bucket
>>>>
>>>> Cannot access memory at address 0x8
>>>>
>>>> (gdb) p *(hmap->buckets)
>>>>
>>>> $3 = (struct hmap_node *) 0x0
>>>>
>>>> (gdb) p node->hash & hmap->mask
>>>>
>>>> $4 = 79
>>>>
>>>> (gdb) p *(&hmap->buckets[79])
>>>>
>>>> $5 = (struct hmap_node *) 0x0
>>>>
>>>> I managed to reproduce the error using a while loop in which I repeatedly
>>>> create and destroy a lacp bond, waiting for 15 seconds after each of
>>>> these
>>>> operations.
>>>>
>>>> The error is not constant in timing and I think that it can be a race
>>>> condition due to the recirculation process. It happens indeed that when
>>>> calling hmap_remove() from update_recirc_rules() (in case of DEL), the
>>>> bucket points to NULL. My idea is the following:
>>>>
>>>> hmap_remove() is called from two different places in ofproto/bond.c:
>>>>
>>>> 1. from update_recirc_rules() (in which it raises the segfault)
>>>>
>>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>> *pr_op->pr_rule = NULL;
>>>> free(pr_op);
>>>>
>>>> 2. and from bond_unref() (called when I try to delete the bond)
>>>>
>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>>       hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>>       free(pr_op);
>>>> }
>>>>
>>>> Now, if these two calls happen at the same time, a conflict on the bond
>>>> pr_rule may happen. Looking at the code I can see that the recirculation
>>>> hmap_remove() is executed inside an external lock
>>>> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance()
>>>>
>>>> void bond_rebalance(struct bond *bond) {
>>>> ...
>>>> ovs_rwlock_wrlock(&rwlock);
>>>> ...
>>>> ...
>>>> if (use_recirc && rebalanced) {
>>>>       bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
>>>> }
>>>>
>>>> done:
>>>>       ovs_rwlock_unlock(&rwlock);
>>>> }
>>>>
>>>> but I can't see any lock when called by output_normal()
>>>>
>>>> static void output_normal(struct xlate_ctx *ctx, const struct xbundle
>>>> *out_xbundle, uint16_t vlan) {
>>>> ...
>>>> if (ctx->use_recirc) {
>>>>       ...
>>>>       bond_update_post_recirc_rules(out_xbundle->bond, false); <----
>>>> hmap_remove()
>>>>       ...
>>>> }
>>>> ...
>>>>
>>>> The same lock is used in bond_unref() but only for hmap_remove(all_bonds,
>>>> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops,
>>>> &pr_op->hmap_node).
>>>>
>>>> ...
>>>> ovs_rwlock_wrlock(&rwlock);
>>>> hmap_remove(all_bonds, &bond->hmap_node);
>>>> ovs_rwlock_unlock(&rwlock);
>>>> ...
>>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>>       hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>>       free(pr_op);
>>>> }
>>>> ...
>>>>
>>>> I suppose that the goal is to remove any pointer to the bond, from
>>>> all_bonds, inside the lock and to work on it locally. My doubt is: can
>>>> the
>>>> bond be still accessible from somewhere else (let's say from a bundle)?
>>>> If
>>>> yes, what happen if a bundle tries to access a bond which was previously
>>>> removed from all_bonds (let's say from bundle_destroy())?
>>>>
>>>> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64.
>>>>
>>>> Could you please help me to find the problem?
>>>>
>>>>
>>>> Thank you for the help & kind regards,
>>>>
>>>> Salvatore
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss at openvswitch.org
>>>> http://openvswitch.org/mailman/listinfo/discuss
>>>>