[ovs-discuss] Thread Safety in rconn

Dirk van der Merwe dirk.vandermerwe at netronome.com
Thu Aug 29 14:35:05 UTC 2013


Hi Ben

I am seeing both threads touching the rconn. See below the stack traces for
the two threads.

Main Thread Stack Trace:
--------------------------------------

#0  0x00000036faee983c in send () from /lib64/libc.so.6
No symbol table info available.
#1  0x00000036faee4e60 in __vsyslog_chk () from /lib64/libc.so.6
No symbol table info available.
#2  0x00000036faee5210 in syslog () from /lib64/libc.so.6
No symbol table info available.
#3  0x00000000004ca21c in vlog_valist (module=0x7d1940, level=VLL_ERR,
    message=0x53e3f0 "%s: no response to inactivity probe after %u seconds,
disconnecting",
    args=0x7fff250b9fb0) at lib/vlog.c:815
        syslog_level = 3
        save_ptr = 0x219c28b ""
        line = 0x219c220 "ovs|00390|rconn|ERR|NFE<->tcp:xxx.xxx.xxx.xxx: no
response to inactivity probe after 24 seconds, disconnecting"
        save_errno = 11
        s = {
          string = 0x219c220
"ovs|00390|rconn|ERR|NFE<->tcp:xxx.xxx.xxx.xxx: no response to inactivity
probe after 24 seconds, disconnecting", length = 107, allocated = 1024}
        log_to_console = false
        log_to_syslog = true
        log_to_file = true
#4  0x00000000004ca3c6 in vlog (module=0x7d1940, level=VLL_ERR,
    message=0x53e3f0 "%s: no response to inactivity probe after %u seconds,
disconnecting")
    at lib/vlog.c:846
        args = {{gp_offset = 24, fp_offset = 48, overflow_arg_area =
0x7fff250ba0a0,
            reg_save_area = 0x7fff250b9fd0}}
#5  0x00000000004b43ba in run_IDLE (rc=0x21675b0) at lib/rconn.c:499
        level__ = VLL_ERR
#6  0x00000000004b44ff in rconn_run (rc=0x21675b0) at lib/rconn.c:541
        old_state = 16
        i = 0
#7  0x000000000043bdc2 in ofconn_run (ofconn=0x2167290,
    handle_openflow=0x41afad <handle_openflow>) at ofproto/connmgr.c:1248
        mgr = 0x213a100
        i = 2
#8  0x0000000000439bfb in connmgr_run (mgr=0x213a100,
    handle_openflow=0x41afad <handle_openflow>) at ofproto/connmgr.c:287
        ofconn = 0x2167290
---Type <return> to continue, or q <return> to quit---
        next_ofconn = 0x213a138
        ofservice = 0x4bd452
        i = 16
#9  0x00000000004138c1 in ofproto_run (p=0x213b950) at
ofproto/ofproto.c:1254
        changed_netdevs = {map = {buckets = 0x7fff250ba228, one = 0x0, mask
= 0, n = 0}}
        changed_netdev = 0x0
        ofport = 0x0
        error = 11
#10 0x000000000040a2a6 in bridge_run () at vswitchd/bridge.c:2367
        null_cfg = {header_ = {hmap_node = {hash = 0, next = 0x0}, uuid =
{parts = {0, 0, 0,
                0}}, src_arcs = {prev = 0x0, next = 0x0}, dst_arcs = {prev
= 0x0, next = 0x0},
            table = 0x0, old = 0x0, new = 0x0, prereqs = 0x0, written =
0x0, txn_node = {
              hash = 0, next = 0x0}}, bridges = 0x0, n_bridges = 0, cur_cfg
= 0,
          db_version = 0x0, external_ids = {map = {buckets = 0x80b660, one
= 0x0, mask = 0,
              n = 0}}, manager_options = 0x0, n_manager_options = 0,
next_cfg = 0,
          other_config = {map = {buckets = 0x80b698, one = 0x0, mask = 0, n
= 0}},
          ovs_version = 0x0, ssl = 0x0, statistics = {map = {buckets =
0x80b6c8, one = 0x0,
              mask = 0, n = 0}}, system_type = 0x0, system_version = 0x0}
        cfg = 0x2128f70
        reconf_txn = 0x0
        types = {map = {buckets = 0x2178640, one = 0x0, mask = 3, n = 0}}
        type = 0x0
        vlan_splinters_changed = false
        br = 0x213b470
#11 0x000000000040f427 in main (argc=4, argv=0x7fff250ba4a8) at
vswitchd/ovs-vswitchd.c:118
        unixctl_path = 0x0
        unixctl = 0x211cd10
        sighup = 0x80c268
        remote = 0x211c1c0 "\240\240\026\002"
        exiting = false
        retval = 0


Additional Thread Stack Trace:
---------------------------------------------

#0  rconn_packet_counter_dec (c=0x2167700, n_bytes=164) at lib/rconn.c:914
        level__ = VLL_OFF
        tid = 3996
        __func__ = "rconn_packet_counter_dec"
#1  0x00000000004b4fcd in try_send (rc=0x21675b0) at lib/rconn.c:988
        msg = 0x7f181c02f780
        n_bytes = 164
        counter = 0x2167700
        retval = 0
#2  0x00000000004b4800 in rconn_send (rc=0x21675b0, b=0x7f181c02f780,
counter=0x2167700)
    at lib/rconn.c:646
No locals.
#3  0x00000000004b4855 in rconn_send_with_limit (rc=0x21675b0,
b=0x7f181c02f780,
    counter=0x2167700, queue_limit=100) at lib/rconn.c:672
No locals.
#4  0x000000000043c41a in do_send_packet_in (ofp_packet_in=0x7f181c02f780,
ofconn_=0x2167290)
    at ofproto/connmgr.c:1448
        ofconn = 0x2167290
#5  0x0000000000440403 in pinsched_send (ps=0x0, port_no=23,
packet=0x7f181c02f780,
    cb=0x43c3dd <do_send_packet_in>, aux=0x2167290) at
ofproto/pinsched.c:192
No locals.
#6  0x000000000043c54a in schedule_packet_in (ofconn=0x2167290, pin=...)
    at ofproto/connmgr.c:1487
        mgr = 0x213a100
#7  0x000000000043c3ba in connmgr_send_packet_in (mgr=0x213a100,
pin=0x7f1822424d50)
    at ofproto/connmgr.c:1437
        ofconn = 0x2167290
#8  0x000000000041daec in ofproto_dpif_send_packet_in (ofproto=0x213b940,
pin=0x7f1822424d50)
    at ofproto/ofproto-sdn.c:578
No locals.

The second thread shown above is the one that generated the packet_in
message. But in this case, its also the one trying to send it on.
In other cases, the main thread has tried to send on the packet.

Regards
Dirk vd Merwe



On Wed, Aug 28, 2013 at 10:06 PM, Ben Pfaff <blp at nicira.com> wrote:

> On Wed, Aug 28, 2013 at 01:25:01PM +0200, Dirk van der Merwe wrote:
> > We have implemented another datapath protocol provider (replaced
> > ofproto-dpif module) specifically interfacing with our hardware, and then
> > slotting into OVS.
> >
> > When I send a couple of thousand packets (without any flow configured),
> > thus generating packet_in messages to the controller, at some point I get
> > an assertion error in 'rconn_packet_counter_dec' function within the
> > 'rconn' source file. Note that the rconn queue needs to get backed up for
> > this to happen.
> >
> > The assertion fails on anyone of these two statements each time:
> >     ovs_assert(c->n_packets > 0);
> >     ovs_assert(c->n_bytes >= n_bytes);
> >
> > I have seen that there are two threads operating on the rconn queue, the
> > one thread doing the maintenance of the rconn (when the queue starts to
> > back up) and the other when a packet_in is actually generated. And it
> seems
> > that there is a race condition occurring here.
>
> If two threads are actually touching the rconn directly, that's a bug.
> There's supposed to be only a single thread (the "main" thread) touching
> the rconn queue.  The threads that generate packet_ins are supposed to
> queue packets to that main thread using ofproto_dpif_send_packet_in().
>
> Do you see where the unexpected access to the rconn is happening?
>
> Thanks,
>
> Ben.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://openvswitch.org/pipermail/ovs-discuss/attachments/20130829/4f4ae88a/attachment.html>


More information about the discuss mailing list