[ovs-dev] [PATCH] Add some developer documentation on the bonding implementation.

Ben Pfaff blp at nicira.com
Tue Sep 8 20:04:02 UTC 2009

CC: Justin Pettit <jpettit at nicira.com>
+                       ========================
+                        ovs-vswitchd Internals
+                       ========================
+This document describes some of the internals of the ovs-vswitchd
+process.  It is not complete.  It tends to be updated on demand, so if
+you have questions about the vswitchd implementation, ask them and
+perhaps we'll add some appropriate documentation here.
+Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so
+code references below should be assumed to refer to that file except
+as otherwise specified.
+Bonding allows two or more interfaces (the "slaves") to share network
+traffic.  From a high-level point of view, bonded interfaces act like
+a single port, but they have the bandwidth of multiple network
+devices, e.g. two 1 GB physical interfaces act like a single 2 GB
+interface.  Bonds also increase robustness: the bonded port does not
+go down as long as at least one of its slaves is up.
+In vswitchd, a bond always has at least two slaves (and may have
+more).  If a configuration error, etc. would cause a bond to have only
+one slave, the port becomes an ordinary port, not a bonded port, and
+none of the special features of bonded ports described in this section
+There are many forms of bonding, but ovs-vswitchd currently implements
+only a single kind, called "source load balancing" or SLB bonding.
+SLB bonding divides traffic among the slaves based on the Ethernet
+source address.  This is useful only if the traffic over the bond has
+multiple Ethernet source addresses, for example if network traffic
+from multiple VMs are multiplexed over the bond.
+Enabling and Disabling Slaves
+When a bond is created, a slave is initially enabled or disabled based
+on whether carrier is detected on the NIC (see iface_create()).  After
+that, a slave is disabled if its carrier goes down for a period of
+time longer than the downdelay, and it is enabled if carrier comes up
+for longer than the updelay (see bond_link_status_update()).  There is
+one exception where the updelay is skipped: if no slaves at all are
+currently enabled, then the first slave on which carrier comes up is
+enabled immediately.
+The updelay should be set to a time longer than the STP forwarding
+delay of the physical switch to which the bond port is connected (if
+STP is enabled on that switch).  Otherwise, the slave will be enabled,
+and load may be shifted to it, before the physical switch starts
+forwarding packets on that port, which can cause some data to be
+"blackholed" for a time.  The exception for a single enabled slave
+does not cause any problem in this regard because when no slaves are
+enabled all output packets are blackholed anyway.
+When a slave becomes disabled, the vswitch immediately chooses a new
+output port for traffic that was destined for that slave (see
+bond_enable_slave()).  It also sends a "gratuitous learning packet" on
+the bond port (on the newly chosen slave) for each MAC address that
+the vswitch has learned on a port other than the bond (see
+bond_send_learning_packets()).  These packets teach the physical
+switch the new slave to use for packets destined for the vswitch's
+other, non-bonded ports.  (This behavior probably makes sense only for
+a vswitch that has only a single physical port (the bond); vswitchd
+should probably provide a way to disable or configure it in other
+Bond Packet Input
+Bond packet input processing takes place in process_flow().
+Bonding accepts unicast packets on any bond slave.  This can
+occasionally cause packet duplication for the first few packets sent
+to a given MAC, if the physical switch attached to the bond is
+flooding packets to that MAC because it has not yet learned the
+correct slave for that MAC.
+Bonding only accepts multicast (and broadcast) packets on a single
+bond slave (the "active slave") at any given time.  Multicast packets
+received on other slaves are dropped.  Otherwise, every multicast
+packet would be duplicated, once for every bond slave, because the
+physical switch attached to the bond will flood those packets.
+Bonding also drops some multicast packets received on the active
+slave: those for the vswitch has learned that the packet's MAC is on a
+port other than the bond port itself.  This is because it is likely
+that the vswitch itself sent the multicast packet out the bond port,
+on a slave other than the active slave, and is now receiving the
+packet back on the active slave.  However, the vswitch makes an
+exception to this rule for broadcast ARP replies, which indicate that
+the MAC has moved to another switch, probably due to VM migration.
+(ARP replies are normally unicast, so this exception does not match
+normal ARP replies.  It will match the learning packets sent on bond
+The active slave is simply the first slave to be enabled after the
+bond is created (see bond_choose_active_iface()).  If the active slave
+is disabled, then a new active slave is chosen among the slaves that
+remain active.  Currently due to the way that configuration works,
+this tends to be the remaining slave whose interface name is first
+alphabetically, but this is by no means guaranteed.
+Bond Packet Output
+When a packet is sent out a bond port, the bond slave actually used is
+selected based on the packet's source MAC (see choose_output_iface()).
+In particular, the source MAC is hashed into one of 256 values, and
+that value is looked up in a hash table (the "bond hash") kept in the
+"bond_hash" member of struct port.  The hash table entry identifies a
+bond slave.  If no bond slave has yet been chosen for that hash table
+entry, vswitchd chooses one arbitrarily.
+Every 10 seconds, vswitchd rebalances the bond slaves (see
+bond_rebalance_port()).  To rebalance, vswitchd examines the
+statistics for the number of bytes transmitted by each slave over
+approximately the past minute, with data sent more recently weighted
+more heavily than data sent less recently.  It considers each of the
+slaves in order from most-loaded to least-loaded.  If highly loaded
+slave H is significantly more heavily loaded than the least-loaded
+slave L, and slave H carries at least two hashes, then vswitchd shifts
+one of H's hashes to L.  However, vswitchd will not shift a hash from
+H to L if that will cause L's load to exceed H's load.
+Currently, "significantly more loaded" means that H must carry at
+least 1 Mbps more traffic, and that traffic must be at least 3%
+greater than L's.
+Management Commands
+vswitchd provides the following bond-related commands.  These commands
+may be invoked using "ovs-appctl -t <target> -e '<command>":
+    Lists all of the bonds, and their slaves, on each bridge.
+bond/show <port>
+    Lists all of the bond-specific information about the bond on the
+    given <port>: updelay, downdelay, time until the next rebalance.
+    Also lists information about each slave: whether it is enabled or
+    disabled, the time to completion of an updelay or downdelay if one
+    is in progress, whether it is the active slave, the MAC hashes
+    assigned to the slave, and the MAC learning table entries that
+    hash to each MAC.
+bond/migrate <port> <hash> <slave>
+    Assigns a given MAC hash to a new slave.  <port> specifies the
+    bond port, <hash> either the MAC hash to be migrated (as a decimal
+    number between 0 and 25) or an Ethernet address to be hashed, and
+    <slave> the new slave to be assigned.
+    The reassignment is not permanent: rebalancing or fail-over will
+    cause the MAC hash to be shifted to a new slave in the usual
+    manner.
+    A MAC hash cannot be migrated to a disabled slave.
+bond/set-active-slave <port> <slave>
+    Sets <slave> as the active slave on <port>.  <slave> must
+    currently be enabled.
+    The setting is not permanent: a new active slave will be selected
+    if <slave> becomes disabled.
+bond/enable-slave <port> <slave>
+bond/disable-slave <port> <slave>
+    Enables (or disables) <slave> on the given bond <port>.
+    This setting is not permanent: it persists only until the carrier
+    status of <slave> changes.

