[ovs-dev] [PATCH] Add some developer documentation on the bonding implementation.
blp at nicira.com
Tue Sep 8 20:04:02 UTC 2009
CC: Justin Pettit <jpettit at nicira.com>
vswitchd/INTERNALS | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 174 insertions(+), 0 deletions(-)
create mode 100644 vswitchd/INTERNALS
diff --git a/vswitchd/INTERNALS b/vswitchd/INTERNALS
new file mode 100644
@@ -0,0 +1,174 @@
+ ovs-vswitchd Internals
+This document describes some of the internals of the ovs-vswitchd
+process. It is not complete. It tends to be updated on demand, so if
+you have questions about the vswitchd implementation, ask them and
+perhaps we'll add some appropriate documentation here.
+Most of the ovs-vswitchd implementation is in vswitchd/bridge.c, so
+code references below should be assumed to refer to that file except
+as otherwise specified.
+Bonding allows two or more interfaces (the "slaves") to share network
+traffic. From a high-level point of view, bonded interfaces act like
+a single port, but they have the bandwidth of multiple network
+devices, e.g. two 1 GB physical interfaces act like a single 2 GB
+interface. Bonds also increase robustness: the bonded port does not
+go down as long as at least one of its slaves is up.
+In vswitchd, a bond always has at least two slaves (and may have
+more). If a configuration error, etc. would cause a bond to have only
+one slave, the port becomes an ordinary port, not a bonded port, and
+none of the special features of bonded ports described in this section
+There are many forms of bonding, but ovs-vswitchd currently implements
+only a single kind, called "source load balancing" or SLB bonding.
+SLB bonding divides traffic among the slaves based on the Ethernet
+source address. This is useful only if the traffic over the bond has
+multiple Ethernet source addresses, for example if network traffic
+from multiple VMs are multiplexed over the bond.
+Enabling and Disabling Slaves
+When a bond is created, a slave is initially enabled or disabled based
+on whether carrier is detected on the NIC (see iface_create()). After
+that, a slave is disabled if its carrier goes down for a period of
+time longer than the downdelay, and it is enabled if carrier comes up
+for longer than the updelay (see bond_link_status_update()). There is
+one exception where the updelay is skipped: if no slaves at all are
+currently enabled, then the first slave on which carrier comes up is
+The updelay should be set to a time longer than the STP forwarding
+delay of the physical switch to which the bond port is connected (if
+STP is enabled on that switch). Otherwise, the slave will be enabled,
+and load may be shifted to it, before the physical switch starts
+forwarding packets on that port, which can cause some data to be
+"blackholed" for a time. The exception for a single enabled slave
+does not cause any problem in this regard because when no slaves are
+enabled all output packets are blackholed anyway.
+When a slave becomes disabled, the vswitch immediately chooses a new
+output port for traffic that was destined for that slave (see
+bond_enable_slave()). It also sends a "gratuitous learning packet" on
+the bond port (on the newly chosen slave) for each MAC address that
+the vswitch has learned on a port other than the bond (see
+bond_send_learning_packets()). These packets teach the physical
+switch the new slave to use for packets destined for the vswitch's
+other, non-bonded ports. (This behavior probably makes sense only for
+a vswitch that has only a single physical port (the bond); vswitchd
+should probably provide a way to disable or configure it in other
+Bond Packet Input
+Bond packet input processing takes place in process_flow().
+Bonding accepts unicast packets on any bond slave. This can
+occasionally cause packet duplication for the first few packets sent
+to a given MAC, if the physical switch attached to the bond is
+flooding packets to that MAC because it has not yet learned the
+correct slave for that MAC.
+Bonding only accepts multicast (and broadcast) packets on a single
+bond slave (the "active slave") at any given time. Multicast packets
+received on other slaves are dropped. Otherwise, every multicast
+packet would be duplicated, once for every bond slave, because the
+physical switch attached to the bond will flood those packets.
+Bonding also drops some multicast packets received on the active
+slave: those for the vswitch has learned that the packet's MAC is on a
+port other than the bond port itself. This is because it is likely
+that the vswitch itself sent the multicast packet out the bond port,
+on a slave other than the active slave, and is now receiving the
+packet back on the active slave. However, the vswitch makes an
+exception to this rule for broadcast ARP replies, which indicate that
+the MAC has moved to another switch, probably due to VM migration.
+(ARP replies are normally unicast, so this exception does not match
+normal ARP replies. It will match the learning packets sent on bond
+The active slave is simply the first slave to be enabled after the
+bond is created (see bond_choose_active_iface()). If the active slave
+is disabled, then a new active slave is chosen among the slaves that
+remain active. Currently due to the way that configuration works,
+this tends to be the remaining slave whose interface name is first
+alphabetically, but this is by no means guaranteed.
+Bond Packet Output
+When a packet is sent out a bond port, the bond slave actually used is
+selected based on the packet's source MAC (see choose_output_iface()).
+In particular, the source MAC is hashed into one of 256 values, and
+that value is looked up in a hash table (the "bond hash") kept in the
+"bond_hash" member of struct port. The hash table entry identifies a
+bond slave. If no bond slave has yet been chosen for that hash table
+entry, vswitchd chooses one arbitrarily.
+Every 10 seconds, vswitchd rebalances the bond slaves (see
+bond_rebalance_port()). To rebalance, vswitchd examines the
+statistics for the number of bytes transmitted by each slave over
+approximately the past minute, with data sent more recently weighted
+more heavily than data sent less recently. It considers each of the
+slaves in order from most-loaded to least-loaded. If highly loaded
+slave H is significantly more heavily loaded than the least-loaded
+slave L, and slave H carries at least two hashes, then vswitchd shifts
+one of H's hashes to L. However, vswitchd will not shift a hash from
+H to L if that will cause L's load to exceed H's load.
+Currently, "significantly more loaded" means that H must carry at
+least 1 Mbps more traffic, and that traffic must be at least 3%
+greater than L's.
+vswitchd provides the following bond-related commands. These commands
+may be invoked using "ovs-appctl -t <target> -e '<command>":
+ Lists all of the bonds, and their slaves, on each bridge.
+ Lists all of the bond-specific information about the bond on the
+ given <port>: updelay, downdelay, time until the next rebalance.
+ Also lists information about each slave: whether it is enabled or
+ disabled, the time to completion of an updelay or downdelay if one
+ is in progress, whether it is the active slave, the MAC hashes
+ assigned to the slave, and the MAC learning table entries that
+ hash to each MAC.
+bond/migrate <port> <hash> <slave>
+ Assigns a given MAC hash to a new slave. <port> specifies the
+ bond port, <hash> either the MAC hash to be migrated (as a decimal
+ number between 0 and 25) or an Ethernet address to be hashed, and
+ <slave> the new slave to be assigned.
+ The reassignment is not permanent: rebalancing or fail-over will
+ cause the MAC hash to be shifted to a new slave in the usual
+ A MAC hash cannot be migrated to a disabled slave.
+bond/set-active-slave <port> <slave>
+ Sets <slave> as the active slave on <port>. <slave> must
+ currently be enabled.
+ The setting is not permanent: a new active slave will be selected
+ if <slave> becomes disabled.
+bond/enable-slave <port> <slave>
+bond/disable-slave <port> <slave>
+ Enables (or disables) <slave> on the given bond <port>.
+ This setting is not permanent: it persists only until the carrier
+ status of <slave> changes.
More information about the dev