[ovs-dev] Fwd: [ovs-discuss] OVS segfault in recirculation

Andy Zhou azhou at nicira.com
Fri Feb 13 17:18:16 UTC 2015


Great! looking forward to the test results.

On Fri, Feb 13, 2015 at 3:56 AM, Salvatore Cambria
<salvatore.cambria at citrix.com> wrote:
> Hi Andy,
>
> Yes, it makes a lot of sense!
>
> I applied the patch and I'm running my script again in order to test it. It
> looks good so far!
> I will leave the script running all across the weekend and I'll come back to
> you with my results on Monday.
>
> Thank you for your very fast response and for your effort,
>
> Salvatore
>
> On 13/02/15 02:10, Andy Zhou wrote:
>>
>> Does not mean to drop the list.
>>
>>
>> ---------- Forwarded message ----------
>> From: Andy Zhou <azhou at nicira.com>
>> Date: Thu, Feb 12, 2015 at 3:42 PM
>> Subject: Re: [ovs-discuss] OVS segfault in recirculation
>> To: Salvatore Cambria <salvatore.cambria at citrix.com>
>>
>>
>> Hi, Salvatore,
>>
>> I think I found a bug: hmap_remove needs to be protected by a lock. As
>> you pointed out, the output_normal() path is not properly protected.
>> Would you please
>> try the attached patch to see if it helps?
>>
>> Bond object is protected by reference counting.  Assuming reference
>> counter is valid, it should be O.K. for a bundle to hold on to the
>> object. Notice
>> bundle_destroy() calls unref_bond().  At any rate, I did not spot
>> anything obvious there.
>>
>> Thanks for providing such a detailed description. It is very helpful.
>>
>> Andy
>>
>> ================================
>>
>> iff --git a/ofproto/bond.c b/ofproto/bond.c
>> index c4b3a3a..a9b7a83 100644
>> --- a/ofproto/bond.c
>> +++ b/ofproto/bond.c
>> @@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t
>> *recirc_id,
>>       }
>>   }
>>
>> -void
>> -bond_update_post_recirc_rules(struct bond* bond, const bool force)
>> +static void
>> +bond_update_post_recirc_rules__(struct bond* bond, const bool force)
>>   {
>>      struct bond_entry *e;
>>      bool update_rules = force;  /* Always update rules if caller forces
>> it. */
>> @@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond,
>> const bool force)
>>           update_recirc_rules(bond);
>>      }
>>   }
>> +
>> +void
>> +bond_update_post_recirc_rules(struct bond* bond, const bool force)
>> +{
>> +    ovs_rwlock_wrlock(&rwlock);
>> +    bond_update_post_recirc_rules__(bond, force);
>> +    ovs_rwlock_unlock(&rwlock);
>> +}
>>   ^L
>>   /* Rebalancing. */
>>
>> @@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond)
>>       }
>>
>>       if (use_recirc && rebalanced) {
>> -        bond_update_post_recirc_rules(bond,true);
>> +        bond_update_post_recirc_rules__(bond,true);
>>       }
>>
>>   done:
>>
>>
>> On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria
>> <salvatore.cambria at citrix.com> wrote:
>>>
>>> Hello,
>>>
>>> I am a Software Engineer from Citrix, Cambridge UK office.
>>>
>>> We are having a problem when running our nightly test on a XenServer host
>>> with lacp bonds with OVS. Indeed, when deleting the bond the following
>>> segfault error appears (from GDB):
>>>
>>> Program terminated with signal 11, Segmentation fault.
>>>
>>> #0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730,
>>> hmap=0x1c3e9e8)
>>>
>>>      at lib/hmap.h:236
>>>
>>> 236         while (*bucket != node) {
>>>
>>> (gdb) bt
>>>
>>> #0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730,
>>> hmap=0x1c3e9e8)
>>>
>>>      at lib/hmap.h:236
>>>
>>> #1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382
>>>
>>> #2  0x0000000000001ff5 in ?? ()
>>>
>>> (gdb) p *hmap
>>>
>>> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}
>>>
>>> (gdb) p *node
>>>
>>> $2 = {hash = 2450845263, next = 0x0}
>>>
>>> (gdb) p *bucket
>>>
>>> Cannot access memory at address 0x8
>>>
>>> (gdb) p *(hmap->buckets)
>>>
>>> $3 = (struct hmap_node *) 0x0
>>>
>>> (gdb) p node->hash & hmap->mask
>>>
>>> $4 = 79
>>>
>>> (gdb) p *(&hmap->buckets[79])
>>>
>>> $5 = (struct hmap_node *) 0x0
>>>
>>> I managed to reproduce the error using a while loop in which I repeatedly
>>> create and destroy a lacp bond, waiting for 15 seconds after each of
>>> these
>>> operations.
>>>
>>> The error is not constant in timing and I think that it can be a race
>>> condition due to the recirculation process. It happens indeed that when
>>> calling hmap_remove() from update_recirc_rules() (in case of DEL), the
>>> bucket points to NULL. My idea is the following:
>>>
>>> hmap_remove() is called from two different places in ofproto/bond.c:
>>>
>>> 1. from update_recirc_rules() (in which it raises the segfault)
>>>
>>> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>> *pr_op->pr_rule = NULL;
>>> free(pr_op);
>>>
>>> 2. and from bond_unref() (called when I try to delete the bond)
>>>
>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>      hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>      free(pr_op);
>>> }
>>>
>>> Now, if these two calls happen at the same time, a conflict on the bond
>>> pr_rule may happen. Looking at the code I can see that the recirculation
>>> hmap_remove() is executed inside an external lock
>>> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance()
>>>
>>> void bond_rebalance(struct bond *bond) {
>>> ...
>>> ovs_rwlock_wrlock(&rwlock);
>>> ...
>>> ...
>>> if (use_recirc && rebalanced) {
>>>      bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
>>> }
>>>
>>> done:
>>>      ovs_rwlock_unlock(&rwlock);
>>> }
>>>
>>> but I can't see any lock when called by output_normal()
>>>
>>> static void output_normal(struct xlate_ctx *ctx, const struct xbundle
>>> *out_xbundle, uint16_t vlan) {
>>> ...
>>> if (ctx->use_recirc) {
>>>      ...
>>>      bond_update_post_recirc_rules(out_xbundle->bond, false); <----
>>> hmap_remove()
>>>      ...
>>> }
>>> ...
>>>
>>> The same lock is used in bond_unref() but only for hmap_remove(all_bonds,
>>> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops,
>>> &pr_op->hmap_node).
>>>
>>> ...
>>> ovs_rwlock_wrlock(&rwlock);
>>> hmap_remove(all_bonds, &bond->hmap_node);
>>> ovs_rwlock_unlock(&rwlock);
>>> ...
>>> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>>>      hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>>>      free(pr_op);
>>> }
>>> ...
>>>
>>> I suppose that the goal is to remove any pointer to the bond, from
>>> all_bonds, inside the lock and to work on it locally. My doubt is: can
>>> the
>>> bond be still accessible from somewhere else (let's say from a bundle)?
>>> If
>>> yes, what happen if a bundle tries to access a bond which was previously
>>> removed from all_bonds (let's say from bundle_destroy())?
>>>
>>> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64.
>>>
>>> Could you please help me to find the problem?
>>>
>>>
>>> Thank you for the help & kind regards,
>>>
>>> Salvatore
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at openvswitch.org
>>> http://openvswitch.org/mailman/listinfo/discuss
>>>
>



More information about the dev mailing list