[ovs-discuss] Fwd: OVS segfault in recirculation

Andy Zhou azhou at nicira.com
Fri Feb 13 02:10:03 UTC 2015


Does not mean to drop the list.


---------- Forwarded message ----------
From: Andy Zhou <azhou at nicira.com>
Date: Thu, Feb 12, 2015 at 3:42 PM
Subject: Re: [ovs-discuss] OVS segfault in recirculation
To: Salvatore Cambria <salvatore.cambria at citrix.com>


Hi, Salvatore,

I think I found a bug: hmap_remove needs to be protected by a lock. As
you pointed out, the output_normal() path is not properly protected.
Would you please
try the attached patch to see if it helps?

Bond object is protected by reference counting.  Assuming reference
counter is valid, it should be O.K. for a bundle to hold on to the
object. Notice
bundle_destroy() calls unref_bond().  At any rate, I did not spot
anything obvious there.

Thanks for providing such a detailed description. It is very helpful.

Andy

================================

iff --git a/ofproto/bond.c b/ofproto/bond.c
index c4b3a3a..a9b7a83 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -923,8 +923,8 @@ bond_may_recirc(const struct bond *bond, uint32_t
*recirc_id,
     }
 }

-void
-bond_update_post_recirc_rules(struct bond* bond, const bool force)
+static void
+bond_update_post_recirc_rules__(struct bond* bond, const bool force)
 {
    struct bond_entry *e;
    bool update_rules = force;  /* Always update rules if caller forces it. */
@@ -945,6 +945,14 @@ bond_update_post_recirc_rules(struct bond* bond,
const bool force)
         update_recirc_rules(bond);
    }
 }
+
+void
+bond_update_post_recirc_rules(struct bond* bond, const bool force)
+{
+    ovs_rwlock_wrlock(&rwlock);
+    bond_update_post_recirc_rules__(bond, force);
+    ovs_rwlock_unlock(&rwlock);
+}
 ^L
 /* Rebalancing. */

@@ -1203,7 +1211,7 @@ bond_rebalance(struct bond *bond)
     }

     if (use_recirc && rebalanced) {
-        bond_update_post_recirc_rules(bond,true);
+        bond_update_post_recirc_rules__(bond,true);
     }

 done:


On Tue, Feb 10, 2015 at 7:57 AM, Salvatore Cambria
<salvatore.cambria at citrix.com> wrote:
> Hello,
>
> I am a Software Engineer from Citrix, Cambridge UK office.
>
> We are having a problem when running our nightly test on a XenServer host
> with lacp bonds with OVS. Indeed, when deleting the bond the following
> segfault error appears (from GDB):
>
> Program terminated with signal 11, Segmentation fault.
>
> #0  0x0000000000411d2f in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)
>
>     at lib/hmap.h:236
>
> 236         while (*bucket != node) {
>
> (gdb) bt
>
> #0  0x0000000000411843 in hmap_remove (node=0x7fd9d402e730, hmap=0x1c3e9e8)
>
>     at lib/hmap.h:236
>
> #1  update_recirc_rules (bond=0x1c3e920) at ofproto/bond.c:382
>
> #2  0x0000000000001ff5 in ?? ()
>
> (gdb) p *hmap
>
> $1 = {buckets = 0x7fd9d4039b70, one = 0x0, mask = 255, n = 147}
>
> (gdb) p *node
>
> $2 = {hash = 2450845263, next = 0x0}
>
> (gdb) p *bucket
>
> Cannot access memory at address 0x8
>
> (gdb) p *(hmap->buckets)
>
> $3 = (struct hmap_node *) 0x0
>
> (gdb) p node->hash & hmap->mask
>
> $4 = 79
>
> (gdb) p *(&hmap->buckets[79])
>
> $5 = (struct hmap_node *) 0x0
>
> I managed to reproduce the error using a while loop in which I repeatedly
> create and destroy a lacp bond, waiting for 15 seconds after each of these
> operations.
>
> The error is not constant in timing and I think that it can be a race
> condition due to the recirculation process. It happens indeed that when
> calling hmap_remove() from update_recirc_rules() (in case of DEL), the
> bucket points to NULL. My idea is the following:
>
> hmap_remove() is called from two different places in ofproto/bond.c:
>
> 1. from update_recirc_rules() (in which it raises the segfault)
>
> hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
> *pr_op->pr_rule = NULL;
> free(pr_op);
>
> 2. and from bond_unref() (called when I try to delete the bond)
>
> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>     hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>     free(pr_op);
> }
>
> Now, if these two calls happen at the same time, a conflict on the bond
> pr_rule may happen. Looking at the code I can see that the recirculation
> hmap_remove() is executed inside an external lock
> (ovs_rwlock_wrlock(&rwlock)) if called by bond_rebalance()
>
> void bond_rebalance(struct bond *bond) {
> ...
> ovs_rwlock_wrlock(&rwlock);
> ...
> ...
> if (use_recirc && rebalanced) {
>     bond_update_post_recirc_rules(bond,true); <---- hmap_remove()
> }
>
> done:
>     ovs_rwlock_unlock(&rwlock);
> }
>
> but I can't see any lock when called by output_normal()
>
> static void output_normal(struct xlate_ctx *ctx, const struct xbundle
> *out_xbundle, uint16_t vlan) {
> ...
> if (ctx->use_recirc) {
>     ...
>     bond_update_post_recirc_rules(out_xbundle->bond, false); <----
> hmap_remove()
>     ...
> }
> ...
>
> The same lock is used in bond_unref() but only for hmap_remove(all_bonds,
> &bond->hmap_node) and not for hmap_remove(&bond->pr_rule_ops,
> &pr_op->hmap_node).
>
> ...
> ovs_rwlock_wrlock(&rwlock);
> hmap_remove(all_bonds, &bond->hmap_node);
> ovs_rwlock_unlock(&rwlock);
> ...
> HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
>     hmap_remove(&bond->pr_rule_ops, &pr_op->hmap_node);
>     free(pr_op);
> }
> ...
>
> I suppose that the goal is to remove any pointer to the bond, from
> all_bonds, inside the lock and to work on it locally. My doubt is: can the
> bond be still accessible from somewhere else (let's say from a bundle)? If
> yes, what happen if a bundle tries to access a bond which was previously
> removed from all_bonds (let's say from bundle_destroy())?
>
> The version of OVS that I am using is openvswitch-2.3.0-7.8312.x86_64.
>
> Could you please help me to find the problem?
>
>
> Thank you for the help & kind regards,
>
> Salvatore
>
>
>
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss
>



More information about the discuss mailing list