[ovs-dev] [PATCH 14/14] ovn: Implement basic ARP support for L3 logical routers.

Ben Pfaff blp at ovn.org
Wed Dec 9 01:08:17 UTC 2015


This is sufficient support that an L3 logical router can now transmit
packets to VMs (and other destinations) without having to know the
IP-to-MAC binding in advance.  The details are carefully documented in all
of the appropriate places.

There are several important caveats that need to be fixed before this can
be taken seriously in production.  These are documented in ovn/TODO.  The
most important of these are renewal, expiration, and limiting the size of
the ARP table.

Signed-off-by: Ben Pfaff <blp at ovn.org>
---
 ovn/TODO                        |  55 ++---------
 ovn/controller/lflow.c          |  82 ++++++++++++++--
 ovn/controller/lflow.h          |   1 +
 ovn/controller/ovn-controller.c |   9 +-
 ovn/controller/pinctrl.c        | 206 ++++++++++++++++++++++++++++++++++------
 ovn/controller/pinctrl.h        |   5 +-
 ovn/lib/actions.c               | 165 ++++++++++++++++++++++++++++++++
 ovn/lib/actions.h               |  11 +++
 ovn/lib/expr.c                  |  53 +++++++++++
 ovn/lib/expr.h                  |   3 +
 ovn/northd/ovn-northd.8.xml     | 112 +++++++++++++---------
 ovn/northd/ovn-northd.c         | 105 +++++++++++++++-----
 ovn/ovn-architecture.7.xml      |  78 ++++++++++-----
 ovn/ovn-sb.ovsschema            |  15 ++-
 ovn/ovn-sb.xml                  | 137 +++++++++++++++++++++++++-
 ovn/utilities/ovn-sbctl.c       |   4 +
 tests/ovn.at                    | 174 ++++++++++++++++++++++++++++++---
 tests/test-ovn.c                |   1 +
 18 files changed, 1015 insertions(+), 201 deletions(-)

diff --git a/ovn/TODO b/ovn/TODO
index a827421..bdbf86f 100644
--- a/ovn/TODO
+++ b/ovn/TODO
@@ -4,18 +4,11 @@
 
 ** New OVN logical actions
 
-*** arp
-
-Generates an ARP packet based on the current IPv4 packet and allows it
-to be processed as part of the current pipeline (and then pop back to
-processing the original IPv4 packet).
+*** rate_limit
 
 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
 one per second for a given target.  We might need to do this too.
 
-We probably need to buffer the packet that generated the ARP.  I don't
-know where to do that.
-
 *** icmp4 { action... }
 
 Generates an ICMPv4 packet based on the current IPv4 packet and
@@ -117,37 +110,13 @@ userspace-only and no one has complained yet.)
 
 ** Dynamic IP to MAC bindings
 
-Some bindings from IP address to MAC will undoubtedly need to be
-discovered dynamically through ARP requests.  It's straightforward
-enough for a logical L3 router to generate ARP requests and forward
-them to the appropriate switch.
-
-It's more difficult to figure out where the reply should be processed
-and stored.  It might seem at first that a first-cut implementation
-could just keep track of the binding on the hypervisor that needs to
-know, but that can't happen easily because the VM that sends the reply
-might not be on the same HV as the VM that needs the answer (that is,
-the VM that sent the packet that needs the binding to be resolved) and
-there isn't an easy way for it to know which HV needs the answer.
-
-Thus, the HV that processes the ARP reply (which is unknown when the
-ARP is sent) has to tell all the HVs the binding.  The most obvious
-place for this in the OVN_Southbound database.
-
-Details need to be worked out, including:
-
-*** OVN_Southbound schema changes.
+OVN has basic support for establishing IP to MAC bindings dynamically,
+using ARP.
 
-Possibly bindings could be added to the Port_Binding table by adding
-or modifying columns.  Another possibility is that another table
-should be added.
+*** Ratelimiting.
 
-*** Logical_Flow representation
-
-It would be really nice to maintain the general-purpose nature of
-logical flows, but these bindings might have to include some
-hard-coded special cases, especially when it comes to the relationship
-with populating the bindings into the OVN_Southbound table.
+From casual observation, Linux appears to generate at most one ARP per
+second per destination.
 
 *** Tracking queries
 
@@ -161,16 +130,12 @@ into the database.
 Something needs to make sure that bindings remain valid and expire
 those that become stale.
 
-** MTU handling (fragmentation on output)
-
-** Ratelimiting.
+*** Table size limiting.
 
-*** ARP.
+The table of MAC bindings must not be allowed to grow unreasonably
+large.
 
-*** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
-
-As a point of comparison, Linux doesn't ratelimit TCP resets but I
-think it does everything else.
+** MTU handling (fragmentation on output)
 
 * ovn-controller
 
diff --git a/ovn/controller/lflow.c b/ovn/controller/lflow.c
index 2bfc9e1..f6803df 100644
--- a/ovn/controller/lflow.c
+++ b/ovn/controller/lflow.c
@@ -178,14 +178,12 @@ lookup_port_cb(const void *aux_, const char *port_name, unsigned int *portp)
     return false;
 }
 
-/* Translates logical flows in the Logical_Flow table in the OVN_SB database
- * into OpenFlow flows.  See ovn-architecture(7) for more information. */
-void
-lflow_run(struct controller_ctx *ctx, const struct lport_index *lports,
-          const struct mcgroup_index *mcgroups,
-          const struct simap *ct_zones, struct hmap *flow_table)
+/* Adds the logical flows from the Logical_Flow table to 'flow_table'. */
+static void
+add_logical_flows(struct controller_ctx *ctx, const struct lport_index *lports,
+                  const struct mcgroup_index *mcgroups,
+                  const struct simap *ct_zones, struct hmap *flow_table)
 {
-    struct hmap flows = HMAP_INITIALIZER(&flows);
     uint32_t conj_id_ofs = 1;
 
     const struct sbrec_logical_flow *lflow;
@@ -224,6 +222,7 @@ lflow_run(struct controller_ctx *ctx, const struct lport_index *lports,
             .first_ptable = first_ptable,
             .cur_ltable = lflow->table_id,
             .output_ptable = output_ptable,
+            .arp_ptable = OFTABLE_MAC_BINDING,
         };
         error = actions_parse_string(lflow->actions, &ap, &ofpacts, &prereqs);
         if (error) {
@@ -300,6 +299,75 @@ lflow_run(struct controller_ctx *ctx, const struct lport_index *lports,
     }
 }
 
+static void
+put_load(const uint8_t *data, size_t len,
+         enum mf_field_id dst, int ofs, int n_bits,
+         struct ofpbuf *ofpacts)
+{
+    struct ofpact_set_field *sf = ofpact_put_SET_FIELD(ofpacts);
+    sf->field = mf_from_id(dst);
+    sf->flow_has_vlan = false;
+
+    bitwise_copy(data, len, 0, &sf->value, sf->field->n_bytes, ofs, n_bits);
+    bitwise_one(&sf->mask, sf->field->n_bytes, ofs, n_bits);
+}
+
+/* Adds a flow to table  */
+static void
+add_neighbor_flows(struct controller_ctx *ctx,
+                   const struct lport_index *lports, struct hmap *flow_table)
+{
+    struct ofpbuf ofpacts;
+    struct match match;
+    match_init_catchall(&match);
+    ofpbuf_init(&ofpacts, 0);
+
+    const struct sbrec_mac_binding *b;
+    SBREC_MAC_BINDING_FOR_EACH (b, ctx->ovnsb_idl) {
+        const struct sbrec_port_binding *pb
+            = lport_lookup_by_name(lports, b->logical_port);
+        if (!pb) {
+            continue;
+        }
+
+        struct eth_addr mac;
+        if (!eth_addr_from_string(b->mac, &mac)) {
+            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+            VLOG_WARN_RL(&rl, "bad 'mac' %s", b->mac);
+            continue;
+        }
+
+        ovs_be32 ip;
+        if (!ip_parse(b->ip, &ip)) {
+            static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+            VLOG_WARN_RL(&rl, "bad 'ip' %s", b->ip);
+            continue;
+        }
+
+        match_set_metadata(&match, htonll(pb->datapath->tunnel_key));
+        match_set_reg(&match, MFF_LOG_OUTPORT - MFF_REG0, pb->tunnel_key);
+        match_set_reg(&match, 0, ntohl(ip));
+
+        ofpbuf_clear(&ofpacts);
+        put_load(mac.ea, sizeof mac.ea, MFF_ETH_DST, 0, 48, &ofpacts);
+
+        ofctrl_add_flow(flow_table, OFTABLE_MAC_BINDING, 100,
+                        &match, &ofpacts);
+    }
+    ofpbuf_uninit(&ofpacts);
+}
+
+/* Translates logical flows in the Logical_Flow table in the OVN_SB database
+ * into OpenFlow flows.  See ovn-architecture(7) for more information. */
+void
+lflow_run(struct controller_ctx *ctx, const struct lport_index *lports,
+          const struct mcgroup_index *mcgroups,
+          const struct simap *ct_zones, struct hmap *flow_table)
+{
+    add_logical_flows(ctx, lports, mcgroups, ct_zones, flow_table);
+    add_neighbor_flows(ctx, lports, flow_table);
+}
+
 void
 lflow_destroy(void)
 {
diff --git a/ovn/controller/lflow.h b/ovn/controller/lflow.h
index b8c53ce..33dd9ff 100644
--- a/ovn/controller/lflow.h
+++ b/ovn/controller/lflow.h
@@ -53,6 +53,7 @@ struct uuid;
 #define OFTABLE_DROP_LOOPBACK        34
 #define OFTABLE_LOG_EGRESS_PIPELINE  48 /* First of LOG_PIPELINE_LEN tables. */
 #define OFTABLE_LOG_TO_PHY           64
+#define OFTABLE_MAC_BINDING          65
 
 /* The number of tables for the ingress and egress pipelines. */
 #define LOG_PIPELINE_LEN 16
diff --git a/ovn/controller/ovn-controller.c b/ovn/controller/ovn-controller.c
index f5dbecc..b2a2218 100644
--- a/ovn/controller/ovn-controller.c
+++ b/ovn/controller/ovn-controller.c
@@ -299,7 +299,7 @@ main(int argc, char *argv[])
 
             enum mf_field_id mff_ovn_geneve = ofctrl_run(br_int);
 
-            pinctrl_run(&ctx, br_int);
+            pinctrl_run(&ctx, &lports, br_int);
 
             struct hmap flow_table = HMAP_INITIALIZER(&flow_table);
             lflow_run(&ctx, &lports, &mcgroups, &ct_zones, &flow_table);
@@ -320,13 +320,12 @@ main(int argc, char *argv[])
             poll_immediate_wake();
         }
 
-        ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop);
-        ovsdb_idl_loop_commit_and_wait(&ovs_idl_loop);
-
         if (br_int) {
             ofctrl_wait();
-            pinctrl_wait();
+            pinctrl_wait(&ctx);
         }
+        ovsdb_idl_loop_commit_and_wait(&ovnsb_idl_loop);
+        ovsdb_idl_loop_commit_and_wait(&ovs_idl_loop);
         poll_block();
         if (should_service_stop()) {
             exiting = true;
diff --git a/ovn/controller/pinctrl.c b/ovn/controller/pinctrl.c
index 8c53c19..1ccdbf3 100644
--- a/ovn/controller/pinctrl.c
+++ b/ovn/controller/pinctrl.c
@@ -15,14 +15,23 @@
  */
 
 #include <config.h>
-#include "dirs.h"
+
 #include "pinctrl.h"
+
+#include "dirs.h"
+#include "dp-packet.h"
+#include "lport.h"
+#include "ovn/lib/actions.h"
+#include "ovn/lib/logical-fields.h"
 #include "ofp-msgs.h"
 #include "ofp-print.h"
 #include "ofp-util.h"
 #include "rconn.h"
 #include "openvswitch/vlog.h"
+#include "ovn-controller.h"
+#include "poll-loop.h"
 #include "socket-util.h"
+#include "timeval.h"
 #include "vswitch-idl.h"
 
 VLOG_DEFINE_THIS_MODULE(pinctrl);
@@ -34,6 +43,12 @@ static struct rconn *swconn;
  * rconn_get_connection_seqno(rconn), 'swconn' has reconnected. */
 static unsigned int conn_seq_no;
 
+static void process_put_arp(const struct lport_index *,
+                            const struct flow *md, const struct flow *headers);
+static void flush_put_arps(void);
+static void run_put_arps(struct controller_ctx *);
+static void wait_put_arps(struct controller_ctx *);
+
 void
 pinctrl_init(void)
 {
@@ -74,24 +89,40 @@ set_switch_config(struct rconn *swconn, const struct ofp_switch_config *config)
 }
 
 static void
-process_packet_in(struct controller_ctx *ctx OVS_UNUSED,
+process_packet_in(const struct lport_index *lports,
                   const struct ofp_header *msg)
 {
-    struct ofputil_packet_in pin;
+    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
 
-    if (ofputil_decode_packet_in(&pin, msg) != 0) {
-        return;
-    }
-    if (pin.reason != OFPR_ACTION) {
+    struct ofputil_packet_in pin;
+    enum ofperr error = ofputil_decode_packet_in(&pin, msg);
+    if (error) {
+        VLOG_WARN_RL(&rl, "error decoding packet-in: %s",
+                     ofperr_to_string(error));
         return;
     }
 
-    /* XXX : process the received packet */
+    struct dp_packet packet;
+    dp_packet_use_const(&packet, pin.packet, pin.packet_len);
+    struct flow headers;
+    flow_extract(&packet, &headers);
+
+    const struct flow *md = &pin.flow_metadata.flow;
+    switch (md->regs[0]) {
+    case ACTION_OPCODE_PUT_ARP:
+        process_put_arp(lports, md, &headers);
+        break;
+
+    default:
+        VLOG_WARN_RL(&rl, "unrecognized packet-in command %#"PRIx32,
+                     md->regs[0]);
+        break;
+    }
 }
 
 static void
-pinctrl_recv(struct controller_ctx *ctx, const struct ofp_header *oh,
-             enum ofptype type)
+pinctrl_recv(const struct lport_index *lports,
+             const struct ofp_header *oh, enum ofptype type)
 {
     if (type == OFPTYPE_ECHO_REQUEST) {
         queue_msg(make_echo_reply(oh));
@@ -105,7 +136,7 @@ pinctrl_recv(struct controller_ctx *ctx, const struct ofp_header *oh,
         config.miss_send_len = htons(UINT16_MAX);
         set_switch_config(swconn, &config);
     } else if (type == OFPTYPE_PACKET_IN) {
-        process_packet_in(ctx, oh);
+        process_packet_in(lports, oh);
     } else if (type != OFPTYPE_ECHO_REPLY && type != OFPTYPE_BARRIER_REPLY) {
         if (VLOG_IS_DBG_ENABLED()) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(30, 300);
@@ -119,7 +150,8 @@ pinctrl_recv(struct controller_ctx *ctx, const struct ofp_header *oh,
 }
 
 void
-pinctrl_run(struct controller_ctx *ctx, const struct ovsrec_bridge *br_int)
+pinctrl_run(struct controller_ctx *ctx, const struct lport_index *lports,
+            const struct ovsrec_bridge *br_int)
 {
     if (br_int) {
         char *target;
@@ -136,32 +168,36 @@ pinctrl_run(struct controller_ctx *ctx, const struct ovsrec_bridge *br_int)
 
     rconn_run(swconn);
 
-    if (!rconn_is_connected(swconn)) {
-        return;
-    }
+    if (rconn_is_connected(swconn)) {
+        if (conn_seq_no != rconn_get_connection_seqno(swconn)) {
+            get_switch_config(swconn);
+            conn_seq_no = rconn_get_connection_seqno(swconn);
+            flush_put_arps();
+        }
 
-    if (conn_seq_no != rconn_get_connection_seqno(swconn)) {
-        get_switch_config(swconn);
-        conn_seq_no = rconn_get_connection_seqno(swconn);
-    }
+        /* Process a limited number of messages per call. */
+        for (int i = 0; i < 50; i++) {
+            struct ofpbuf *msg = rconn_recv(swconn);
+            if (!msg) {
+                break;
+            }
 
-    struct ofpbuf *msg = rconn_recv(swconn);
+            const struct ofp_header *oh = msg->data;
+            enum ofptype type;
 
-    if (!msg) {
-        return;
+            ofptype_decode(&type, oh);
+            pinctrl_recv(lports, oh, type);
+            ofpbuf_delete(msg);
+        }
     }
 
-    const struct ofp_header *oh = msg->data;
-    enum ofptype type;
-
-    ofptype_decode(&type, oh);
-    pinctrl_recv(ctx, oh, type);
-    ofpbuf_delete(msg);
+    run_put_arps(ctx);
 }
 
 void
-pinctrl_wait(void)
+pinctrl_wait(struct controller_ctx *ctx)
 {
+    wait_put_arps(ctx);
     rconn_run_wait(swconn);
     rconn_recv_wait(swconn);
 }
@@ -170,4 +206,116 @@ void
 pinctrl_destroy(void)
 {
     rconn_destroy(swconn);
+    flush_put_arps();
+}
+
+/* Implementation of the "put_arp" OVN action.  This action sends a packet to
+ * ovn-controller, using the flow as an API (see actions.h for details).  This
+ * code implements the action by updating the MAC_Binding table in the
+ * southbound database.
+ *
+ * This code could be a lot simpler if the database could always be updated,
+ * but in fact we can only update it when ctx->ovnsb_idl_txn is nonnull.  Thus,
+ * we buffer up a few put_arps (but we don't keep them longer than 1 second)
+ * and apply them whenever a database transaction is available. */
+
+/* Buffered "put_arp" operations. */
+struct put_arp {
+    long long int timestamp;    /* In milliseconds. */
+    char *logical_port;
+    ovs_be32 ip;
+    struct eth_addr mac;
+};
+static struct put_arp put_arps[1024];
+static size_t n_put_arps;
+
+static void
+process_put_arp(const struct lport_index *lports,
+                const struct flow *md, const struct flow *headers)
+{
+    if (n_put_arps >= ARRAY_SIZE(put_arps)) {
+        return;
+    }
+
+    /* Convert logical datapath and logical port key into lport. */
+    uint32_t dp_key = ntohll(md->metadata);
+    uint32_t port_key = md->regs[MFF_LOG_INPORT - MFF_REG0];
+    const struct sbrec_port_binding *pb
+        = lport_lookup_by_key(lports, dp_key, port_key);
+    if (!pb) {
+        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
+
+        VLOG_WARN_RL(&rl, "unknown logical port with datapath %"PRIu32" and "
+                     "port %"PRIu32, dp_key, port_key);
+        return;
+    }
+
+    struct put_arp *pa = &put_arps[n_put_arps++];
+    pa->timestamp = time_msec();
+    pa->logical_port = xstrdup(pb->logical_port);
+    pa->ip = htonl(md->regs[1]);
+    pa->mac = headers->dl_src;
+}
+
+static void
+flush_put_arps(void)
+{
+    for (struct put_arp *pa = put_arps; pa < &put_arps[n_put_arps]; pa++) {
+        free(pa->logical_port);
+    }
+    n_put_arps = 0;
+}
+
+static void
+run_put_arps(struct controller_ctx *ctx)
+{
+    if (!ctx->ovnsb_idl_txn) {
+        return;
+    }
+
+    for (const struct put_arp *pa = put_arps; pa < &put_arps[n_put_arps];
+         pa++) {
+        if (time_msec() > pa->timestamp + 1000) {
+            continue;
+        }
+
+        /* Convert arguments to string form for database. */
+        char ip_string[INET_ADDRSTRLEN + 1];
+        snprintf(ip_string, sizeof ip_string, IP_FMT, IP_ARGS(pa->ip));
+        char mac_string[ETH_ADDR_STRLEN + 1];
+        snprintf(mac_string, sizeof mac_string,
+                 ETH_ADDR_FMT, ETH_ADDR_ARGS(pa->mac));
+
+        /* Check for and update an existing IP-MAC binding for this logical
+         * port.
+         *
+         * XXX This is not very efficient. */
+        const struct sbrec_mac_binding *b;
+        SBREC_MAC_BINDING_FOR_EACH (b, ctx->ovnsb_idl) {
+            if (!strcmp(b->logical_port, pa->logical_port)
+                && !strcmp(b->ip, ip_string)) {
+                if (strcmp(b->mac, mac_string)) {
+                    sbrec_mac_binding_set_mac(b, mac_string);
+                }
+                goto next;
+            }
+        }
+
+        /* Add new IP-MAC binding for this logical port. */
+        b = sbrec_mac_binding_insert(ctx->ovnsb_idl_txn);
+        sbrec_mac_binding_set_logical_port(b, pa->logical_port);
+        sbrec_mac_binding_set_ip(b, ip_string);
+        sbrec_mac_binding_set_mac(b, mac_string);
+    next:;
+    }
+
+    flush_put_arps();
+}
+
+static void
+wait_put_arps(struct controller_ctx *ctx)
+{
+    if (ctx->ovnsb_idl_txn && n_put_arps) {
+        poll_immediate_wake();
+    }
 }
diff --git a/ovn/controller/pinctrl.h b/ovn/controller/pinctrl.h
index 65d5dfe..7938a5f 100644
--- a/ovn/controller/pinctrl.h
+++ b/ovn/controller/pinctrl.h
@@ -21,14 +21,15 @@
 
 #include "meta-flow.h"
 
+struct lport_index;
 struct ovsrec_bridge;
 struct controller_ctx;
 
 /* Interface for OVN main loop. */
 void pinctrl_init(void);
-void pinctrl_run(struct controller_ctx *ctx,
+void pinctrl_run(struct controller_ctx *, const struct lport_index *,
                  const struct ovsrec_bridge *br_int);
-void pinctrl_wait(void);
+void pinctrl_wait(struct controller_ctx *);
 void pinctrl_destroy(void);
 
 #endif /* ovn/dhcp.h */
diff --git a/ovn/lib/actions.c b/ovn/lib/actions.c
index e7dea8d..8cf0032 100644
--- a/ovn/lib/actions.c
+++ b/ovn/lib/actions.c
@@ -215,6 +215,167 @@ parse_arp_action(struct action_context *ctx)
     add_prerequisite(ctx, "ip4");
 }
 
+static bool
+action_force_match(struct action_context *ctx, enum lex_type t)
+{
+    if (lexer_match(ctx->lexer, t)) {
+        return true;
+    } else {
+        struct lex_token token = { .type = t };
+        struct ds s = DS_EMPTY_INITIALIZER;
+        lex_token_format(&token, &s);
+
+        action_syntax_error(ctx, "expecting `%s'", ds_cstr(&s));
+
+        ds_destroy(&s);
+
+        return false;
+    }
+}
+
+static bool
+action_parse_field(struct action_context *ctx,
+                   int n_bits, struct mf_subfield *sf)
+{
+    struct expr *prereqs;
+    char *error;
+
+    error = expr_parse_field(ctx->lexer, n_bits, false, ctx->ap->symtab, sf,
+                             &prereqs);
+    if (error) {
+        action_error(ctx, "%s", error);
+        return false;
+    }
+
+    ctx->prereqs = expr_combine(EXPR_T_AND, ctx->prereqs, prereqs);
+    return true;
+}
+
+static void
+init_stack(struct ofpact_stack *stack, enum mf_field_id field)
+{
+    stack->subfield.field = mf_from_id(field);
+    stack->subfield.ofs = 0;
+    stack->subfield.n_bits = stack->subfield.field->n_bits;
+}
+
+struct arg {
+    const struct mf_subfield *src;
+    enum mf_field_id dst;
+};
+
+static void
+setup_args(struct action_context *ctx,
+           const struct arg args[], size_t n_args)
+{
+    /* 1. Save all of the destinations that will be modified. */
+    for (const struct arg *a = args; a < &args[n_args]; a++) {
+        ovs_assert(a->src->n_bits == mf_from_id(a->dst)->n_bits);
+        if (a->src->field->id != a->dst) {
+            init_stack(ofpact_put_STACK_PUSH(ctx->ofpacts), a->dst);
+        }
+    }
+
+    /* 2. Push the sources, in reverse order. */
+    for (size_t i = n_args - 1; i < n_args; i--) {
+        const struct arg *a = &args[i];
+        if (a->src->field->id != a->dst) {
+            ofpact_put_STACK_PUSH(ctx->ofpacts)->subfield = *a->src;
+        }
+    }
+
+    /* 3. Pop the sources into the destinations. */
+    for (const struct arg *a = args; a < &args[n_args]; a++) {
+        if (a->src->field->id != a->dst) {
+            init_stack(ofpact_put_STACK_POP(ctx->ofpacts), a->dst);
+        }
+    }
+}
+
+static void
+restore_args(struct action_context *ctx,
+             const struct arg args[], size_t n_args)
+{
+    for (size_t i = n_args - 1; i < n_args; i--) {
+        const struct arg *a = &args[i];
+        if (a->src->field->id != a->dst) {
+            init_stack(ofpact_put_STACK_POP(ctx->ofpacts), a->dst);
+        }
+    }
+}
+
+static void
+put_load(uint64_t value, enum mf_field_id dst, int ofs, int n_bits,
+         struct ofpbuf *ofpacts)
+{
+    struct ofpact_set_field *sf = ofpact_put_SET_FIELD(ofpacts);
+    sf->field = mf_from_id(dst);
+    sf->flow_has_vlan = false;
+
+    ovs_be64 n_value = htonll(value);
+    bitwise_copy(&n_value, 8, 0, &sf->value, sf->field->n_bytes, ofs, n_bits);
+    bitwise_one(&sf->mask, sf->field->n_bytes, ofs, n_bits);
+}
+
+static void
+parse_get_arp_action(struct action_context *ctx)
+{
+    struct mf_subfield port, ip;
+
+    if (!action_force_match(ctx, LEX_T_LPAREN)
+        || !action_parse_field(ctx, 0, &port)
+        || !action_force_match(ctx, LEX_T_COMMA)
+        || !action_parse_field(ctx, 32, &ip)
+        || !action_force_match(ctx, LEX_T_RPAREN)) {
+        return;
+    }
+
+    const struct arg args[] = {
+        { &port, MFF_LOG_OUTPORT },
+        { &ip, MFF_REG0 },
+    };
+    setup_args(ctx, args, ARRAY_SIZE(args));
+
+    put_load(0, MFF_ETH_DST, 0, 48, ctx->ofpacts);
+    emit_resubmit(ctx, ctx->ap->arp_ptable);
+
+    restore_args(ctx, args, ARRAY_SIZE(args));
+}
+
+static void
+parse_put_arp_action(struct action_context *ctx)
+{
+    struct mf_subfield port, ip, mac;
+
+    if (!action_force_match(ctx, LEX_T_LPAREN)
+        || !action_parse_field(ctx, 0, &port)
+        || !action_force_match(ctx, LEX_T_COMMA)
+        || !action_parse_field(ctx, 32, &ip)
+        || !action_force_match(ctx, LEX_T_COMMA)
+        || !action_parse_field(ctx, 48, &mac)
+        || !action_force_match(ctx, LEX_T_RPAREN)) {
+        return;
+    }
+
+    const struct arg args[] = {
+        { &port, MFF_LOG_INPORT },
+        { &ip, MFF_REG1 },
+        { &mac, MFF_ETH_SRC }
+    };
+    setup_args(ctx, args, ARRAY_SIZE(args));
+
+    init_stack(ofpact_put_STACK_PUSH(ctx->ofpacts), MFF_REG0);
+    put_load(ACTION_OPCODE_PUT_ARP, MFF_REG0, 0, 32, ctx->ofpacts);
+
+    struct ofpact_controller *oc = ofpact_put_CONTROLLER(ctx->ofpacts);
+    oc->max_len = UINT16_MAX;
+    oc->controller_id = 0;
+    oc->reason = OFPR_PACKET_OUT;
+
+    init_stack(ofpact_put_STACK_POP(ctx->ofpacts), MFF_REG0);
+    restore_args(ctx, args, ARRAY_SIZE(args));
+}
+
 static void
 emit_ct(struct action_context *ctx, bool recirc_next, bool commit)
 {
@@ -273,6 +434,10 @@ parse_action(struct action_context *ctx)
         emit_ct(ctx, false, true);
     } else if (lexer_match_id(ctx->lexer, "arp")) {
         parse_arp_action(ctx);
+    } else if (lexer_match_id(ctx->lexer, "get_arp")) {
+        parse_get_arp_action(ctx);
+    } else if (lexer_match_id(ctx->lexer, "put_arp")) {
+        parse_put_arp_action(ctx);
     } else {
         action_syntax_error(ctx, "expecting action");
     }
diff --git a/ovn/lib/actions.h b/ovn/lib/actions.h
index 6ca15c4..0fa59b2 100644
--- a/ovn/lib/actions.h
+++ b/ovn/lib/actions.h
@@ -27,6 +27,16 @@ struct ofpbuf;
 struct shash;
 struct simap;
 
+/* put_arp(port, ip, mac) is implemented by sending a packet to the controller.
+ * Arguments are passed through the packet metadata and data, as follows:
+ *
+ *     MFF_REG0 = ACTION_OPCODE_PUT_ARP, to identify the operation.
+ *     MFF_REG1 = ip
+ *     MFF_LOG_INPORT = port
+ *     MFF_ETH_SRC = mac
+ */
+#define ACTION_OPCODE_PUT_ARP 0xbd9c9810
+
 struct action_params {
     /* A table of "struct expr_symbol"s to support (as one would provide to
      * expr_parse()). */
@@ -62,6 +72,7 @@ struct action_params {
     uint8_t first_ptable;       /* First OpenFlow table. */
     uint8_t cur_ltable;         /* 0 <= cur_ltable < n_tables. */
     uint8_t output_ptable;      /* OpenFlow table for 'output' to resubmit. */
+    uint8_t arp_ptable;         /* OpenFlow table for 'get_arp' to resubmit. */
 };
 
 char *actions_parse(struct lexer *, const struct action_params *,
diff --git a/ovn/lib/expr.c b/ovn/lib/expr.c
index 54e3085..007708c 100644
--- a/ovn/lib/expr.c
+++ b/ovn/lib/expr.c
@@ -2870,3 +2870,56 @@ expr_parse_assignment(struct lexer *lexer, const struct shash *symtab,
     *prereqsp = prereqs;
     return ctx.error;
 }
+
+char *
+expr_parse_field(struct lexer *lexer, int n_bits, bool rw,
+                 const struct shash *symtab,
+                 struct mf_subfield *sf, struct expr **prereqsp)
+{
+    struct expr *prereqs = NULL;
+    struct expr_context ctx;
+    ctx.lexer = lexer;
+    ctx.symtab = symtab;
+    ctx.error = NULL;
+    ctx.not = false;
+
+    struct expr_field field;
+    if (!parse_field(&ctx, &field)) {
+        goto exit;
+    }
+
+    const struct expr_field orig_field = field;
+    if (!expand_symbol(&ctx, rw, &field, &prereqs)) {
+        goto exit;
+    }
+    ovs_assert(field.n_bits == orig_field.n_bits);
+
+    if (n_bits != field.n_bits) {
+        if (n_bits && field.n_bits) {
+            expr_error(&ctx, "Cannot use %d-bit field %s[%d..%d] "
+                       "where %d-bit field is required.",
+                       orig_field.n_bits, orig_field.symbol->name,
+                       orig_field.ofs, orig_field.ofs + orig_field.n_bits - 1,
+                       n_bits);
+        } else if (n_bits) {
+            expr_error(&ctx, "Cannot use string field %s where numeric "
+                       "field is required.",
+                       orig_field.symbol->name);
+        } else {
+            expr_error(&ctx, "Cannot use numeric field %s where string "
+                       "field is required.",
+                       orig_field.symbol->name);
+        }
+    }
+
+exit:
+    if (!ctx.error) {
+        mf_subfield_from_expr_field(&field, sf);
+        *prereqsp = prereqs;
+    } else {
+        memset(sf, 0, sizeof *sf);
+        expr_destroy(prereqs);
+        *prereqsp = NULL;
+    }
+    return ctx.error;
+}
diff --git a/ovn/lib/expr.h b/ovn/lib/expr.h
index 7d17489..4fe402f 100644
--- a/ovn/lib/expr.h
+++ b/ovn/lib/expr.h
@@ -386,5 +386,8 @@ char *expr_parse_assignment(struct lexer *lexer, const struct shash *symtab,
                                                 unsigned int *portp),
                             const void *aux, struct ofpbuf *ofpacts,
                             struct expr **prereqsp);
+char *expr_parse_field(struct lexer *, int n_bits, bool rw,
+                       const struct shash *symtab, struct mf_subfield *,
+                       struct expr **prereqsp);
 
 #endif /* ovn/expr.h */
diff --git a/ovn/northd/ovn-northd.8.xml b/ovn/northd/ovn-northd.8.xml
index fa7675b..0adb8b2 100644
--- a/ovn/northd/ovn-northd.8.xml
+++ b/ovn/northd/ovn-northd.8.xml
@@ -371,12 +371,12 @@ next;
 
       <li>
         <p>
-          ARP reply.  These flows reply to ARP requests for the router's own IP
-          address.  For each router port <var>P</var> that owns IP address
-          <var>A</var> and Ethernet address <var>E</var>, a priority-90 flow
-          matches <code>inport == <var>P</var> &amp;&amp; arp.tpa ==
-          <var>A</var> &amp;&amp; arp.op == 1</code> (ARP request) with the
-          following actions:
+          Reply to ARP requests.  These flows reply to ARP requests for the
+          router's own IP address.  For each router port <var>P</var> that owns
+          IP address <var>A</var> and Ethernet address <var>E</var>, a
+          priority-90 flow matches <code>inport == <var>P</var> &amp;&amp;
+          arp.op == 1 &amp;&amp; arp.tpa == <var>A</var></code> (ARP request)
+          with the following actions:
         </p>
 
         <pre>
@@ -394,6 +394,13 @@ output;
       </li>
 
       <li>
+        ARP reply handling.  These flows use ARP replies to populate the
+        logical router's ARP table.  A priority-90 flow with match <code>arp.op
+        == 2</code> has actions <code>put_arp(inport, arp.spa,
+        arp.sha);</code>.
+      </li>
+
+      <li>
         <p>
           UDP port unreachable.  Priority-80 flows generate ICMP port
           unreachable messages in reply to UDP datagrams directed to the
@@ -517,7 +524,10 @@ icmp4 {
       to the address in <code>ip4.dst</code>.  This table implements IP
       routing, setting <code>reg0</code> to the next-hop IP address (leaving
       <code>ip4.dst</code>, the packet's final destination, unchanged) and
-      advances to the next table for ARP resolution.
+      advances to the next table for ARP resolution.  It also sets
+      <code>reg1</code> to the IP address owned by the selected router port
+      (which is used later in table 4 as the IP source address for an ARP
+      request, if needed).
     </p>
 
     <p>
@@ -528,7 +538,9 @@ icmp4 {
       <li>
         <p>
           Routing table.  For each route to IPv4 network <var>N</var> with
-          netmask <var>M</var>, a logical flow with match <code>ip4.dst ==
+          netmask <var>M</var>, on router port <var>P</var> with IP address
+          <var>A</var> and Ethernet
+          address <var>E</var>, a logical flow with match <code>ip4.dst ==
           <var>N</var>/<var>M</var></code>, whose priority is the number of
           1-bits in <var>M</var>, has the following actions:
         </p>
@@ -536,6 +548,9 @@ icmp4 {
         <pre>
 ip.ttl--;
 reg0 = <var>G</var>;
+reg1 = <var>A</var>;
+eth.src = <var>E</var>;
+outport = <var>P</var>;
 next;
         </pre>
 
@@ -594,64 +609,73 @@ icmp4 {
     <ul>
       <li>
         <p>
-          Known MAC bindings.  For each IP address <var>A</var> whose host is
-          known to have Ethernet address <var>HE</var> and reside on router
-          port <var>P</var> with Ethernet address <var>PE</var>, a priority-200
-          flow with match <code>reg0 == <var>A</var></code> has the following
-          actions:
+          Static MAC bindings.  MAC bindings can be known statically based on
+          data in the <code>OVN_Northbound</code> database.  For router ports
+          connected to logical switches, MAC bindings can be known statically
+          from the <code>addresses</code> column in the
+          <code>Logical_Port</code> table.  For router ports connected to other
+          logical routers, MAC bindings can be known statically from the
+          <code>mac</code> and <code>network</code> column in the
+          <code>Logical_Router_Port</code> table.
         </p>
 
-        <pre>
-eth.src = <var>PE</var>;
-eth.dst = <var>HE</var>;
-outport = <var>P</var>;
-output;
-        </pre>
+        <p>
+          For each IP address <var>A</var> whose host is known to have Ethernet
+          address <var>E</var> on router port <var>P</var>, a priority-100 flow
+          with match <code>outport === <var>P</var> &amp;&amp; reg0 ==
+          <var>A</var></code> has actions <code>eth.dst = <var>E</var>;
+          next;</code>.
+        </p>
+      </li>
 
+      <li>
         <p>
-          MAC bindings can be known statically based on data in the
-          <code>OVN_Northbound</code> database.  For router ports connected to
-          logical switches, MAC bindings can be known statically from the
-          <code>addresses</code> column in the <code>Logical_Port</code> table.
-          For router ports connected to other logical routers, MAC bindings can
-          be known statically from the <code>mac</code> and
-          <code>network</code> column in the <code>Logical_Router_Port</code>
-          table.
+          Dynamic MAC bindings.  This flows resolves MAC-to-IP bindings that
+          have become known dynamically through ARP.  (The next table will
+          issue an ARP request for cases where the binding is not yet known.)
+        </p>
+
+        <p>
+          A priority-0 logical flow with match <code>1</code> has actions
+          <code>get_arp(outport, reg0); next;</code>.
         </p>
       </li>
+    </ul>
 
+    <h3>Ingress Table 4: ARP Request</h3>
+
+    <p>
+      In the common case where the Ethernet destination has been resolved, this
+      table outputs the packet.  Otherwise, it composes and sends an ARP
+      request.  It holds the following flows:
+    </p>
+
+    <ul>
       <li>
         <p>
-          Unknown MAC bindings.  For each non-gateway route to IPv4 network
-          <var>N</var> with netmask <var>M</var> on router port <var>P</var>
-          that owns IP address <var>A</var> and Ethernet address <var>E</var>,
-          a logical flow with match <code>ip4.dst ==
-          <var>N</var>/<var>M</var></code>, whose priority is the number of
-          1-bits in <var>M</var>, has the following actions:
+          Unknown MAC address.  A priority-100 flow with match <code>eth.dst ==
+          00:00:00:00:00:00</code> has the following actions:
         </p>
 
         <pre>
+rate_limit(outport, ip4.dst);
 arp {
     eth.dst = ff:ff:ff:ff:ff:ff;
-    eth.src = <var>E</var>;
-    arp.sha = <var>E</var>;
-    arp.tha = 00:00:00:00:00:00;
-    arp.spa = <var>A</var>;
-    arp.tpa = ip4.dst;
+    arp.spa = reg1;
     arp.op = 1;  /* ARP request. */
-    outport = <var>P</var>;
     output;
 };
         </pre>
 
         <p>
-          TBD: How to install MAC bindings when an ARP response comes back.
-          (Implement a "learn" action?)
+          (Ingress table 2 initialized <code>reg1</code> with the IP address
+          owned by <code>outport</code>.)
         </p>
+      </li>
 
-        <p>
-          Not yet implemented.
-        </p>
+      <li>
+        Known MAC address.  A priority-0 flow with match <code>1</code> has
+        actions <code>output;</code>.
       </li>
     </ul>
 
diff --git a/ovn/northd/ovn-northd.c b/ovn/northd/ovn-northd.c
index 270b116..3e25682 100644
--- a/ovn/northd/ovn-northd.c
+++ b/ovn/northd/ovn-northd.c
@@ -99,7 +99,8 @@ enum ovn_stage {
     PIPELINE_STAGE(ROUTER, IN,  ADMISSION,   0, "lr_in_admission")    \
     PIPELINE_STAGE(ROUTER, IN,  IP_INPUT,    1, "lr_in_ip_input")     \
     PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,  2, "lr_in_ip_routing")   \
-    PIPELINE_STAGE(ROUTER, IN,  ARP,         3, "lr_in_arp")          \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE, 3, "lr_in_arp_resolve")  \
+    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST, 4, "lr_in_arp_request")  \
                                                                       \
     /* Logical router egress stages. */                               \
     PIPELINE_STAGE(ROUTER, OUT, DELIVERY,    0, "lr_out_delivery")
@@ -240,6 +241,7 @@ struct ovn_datapath {
     struct ovs_list list;       /* In list of similar records. */
 
     /* Logical router data (digested from nbr). */
+    const struct ovn_port *gateway_port;
     ovs_be32 gateway;
 
     /* Logical switch data. */
@@ -389,17 +391,18 @@ join_datapaths(struct northd_context *ctx, struct hmap *datapaths,
 
         od->gateway = 0;
         if (nbr->default_gw) {
-            ovs_be32 ip, mask;
-            char *error = ip_parse_masked(nbr->default_gw, &ip, &mask);
-            if (error || !ip || mask != OVS_BE32_MAX) {
-                static struct vlog_rate_limit rl
-                    = VLOG_RATE_LIMIT_INIT(5, 1);
+            ovs_be32 ip;
+            if (!ip_parse(nbr->default_gw, &ip) || !ip) {
+                static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
                 VLOG_WARN_RL(&rl, "bad 'gateway' %s", nbr->default_gw);
-                free(error);
             } else {
                 od->gateway = ip;
             }
         }
+
+        /* Set the gateway port to NULL.  If there is a gateway, it will get
+         * filled in as we go through the ports later. */
+        od->gateway_port = NULL;
     }
 }
 
@@ -618,6 +621,18 @@ join_logical_ports(struct northd_context *ctx,
                 op->mac = mac;
 
                 op->od = od;
+
+                /* If 'od' has a gateway and 'op' routes to it... */
+                if (od->gateway && !((op->network ^ od->gateway) & op->mask)) {
+                    /* ...and if 'op' is a longer match than the current
+                     * choice... */
+                    const struct ovn_port *gw = od->gateway_port;
+                    int len = gw ? ip_count_cidr_bits(gw->mask) : 0;
+                    if (ip_count_cidr_bits(op->mask) > len) {
+                        /* ...then it's the default gateway port. */
+                        od->gateway_port = op;
+                    }
+                }
             }
         }
     }
@@ -1297,7 +1312,7 @@ lrport_is_enabled(const struct nbrec_logical_router_port *lrport)
 }
 
 static void
-add_route(struct hmap *lflows, struct ovn_datapath *od,
+add_route(struct hmap *lflows, const struct ovn_port *op,
           ovs_be32 network, ovs_be32 mask, ovs_be32 gateway)
 {
     char *match = xasprintf("ip4.dst == "IP_FMT"/"IP_FMT,
@@ -1310,11 +1325,17 @@ add_route(struct hmap *lflows, struct ovn_datapath *od,
     } else {
         ds_put_cstr(&actions, "ip4.dst");
     }
-    ds_put_cstr(&actions, "; next;");
+    ds_put_format(&actions,
+                  "; "
+                  "reg1 = "IP_FMT"; "
+                  "eth.src = "ETH_ADDR_FMT"; "
+                  "outport = %s; "
+                  "next;",
+                  IP_ARGS(op->ip), ETH_ADDR_ARGS(op->mac), op->json_key);
 
     /* The priority here is calculated to implement longest-prefix-match
      * routing. */
-    ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING,
+    ovn_lflow_add(lflows, op->od, S_ROUTER_IN_IP_ROUTING,
                   count_1bits(ntohl(mask)), match, ds_cstr(&actions));
     ds_destroy(&actions);
     free(match);
@@ -1379,6 +1400,11 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
                       "ip4.dst == 0.0.0.0/8",
                       "drop;");
 
+        /* ARP reply handling.  Use ARP replies to populate the logical
+         * router's ARP table. */
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 90, "arp.op == 2",
+                      "put_arp(inport, arp.spa, arp.sha);");
+
         /* Drop Ethernet local broadcast.  By definition this traffic should
          * not be forwarded.*/
         ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_INPUT, 50,
@@ -1468,23 +1494,24 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
     /* Logical router ingress table 2: IP Routing.
      *
      * A packet that arrives at this table is an IP packet that should be
-     * routed to the address in ip4.dst. This table sets reg0 to the next-hop
-     * IP address (leaving ip4.dst, the packet’s final destination, unchanged)
-     * and advances to the next table for ARP resolution. */
+     * routed to the address in ip4.dst. This table sets outport to the correct
+     * output port, eth.src to the output port's MAC address, and reg0 to the
+     * next-hop IP address (leaving ip4.dst, the packet’s final destination,
+     * unchanged), and advances to the next table for ARP resolution. */
     HMAP_FOR_EACH (op, key_node, ports) {
         if (!op->nbr) {
             continue;
         }
 
-        add_route(lflows, op->od, op->network, op->mask, 0);
+        add_route(lflows, op, op->network, op->mask, 0);
     }
     HMAP_FOR_EACH (od, key_node, datapaths) {
         if (!od->nbr) {
             continue;
         }
 
-        if (od->gateway) {
-            add_route(lflows, od, 0, 0, od->gateway);
+        if (od->gateway && od->gateway_port) {
+            add_route(lflows, od->gateway_port, 0, 0, od->gateway);
         }
     }
     /* XXX destination unreachable */
@@ -1527,16 +1554,15 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
                             continue;
                         }
 
-                        char *match = xasprintf("reg0 == "IP_FMT, IP_ARGS(ip));
-                        char *actions = xasprintf("eth.src = "ETH_ADDR_FMT"; "
-                                                  "eth.dst = "ETH_ADDR_FMT"; "
-                                                  "outport = %s; "
-                                                  "output;",
-                                                  ETH_ADDR_ARGS(peer->mac),
-                                                  ETH_ADDR_ARGS(ea),
-                                                  peer->json_key);
+                        char *match = xasprintf(
+                            "outport == %s && reg0 == "IP_FMT,
+                            peer->json_key, IP_ARGS(ip));
+                        char *actions = xasprintf("eth.dst = "ETH_ADDR_FMT"; "
+                                                  "next;",
+                                                  ETH_ADDR_ARGS(ea));
                         ovn_lflow_add(lflows, peer->od,
-                                      S_ROUTER_IN_ARP, 200, match, actions);
+                                      S_ROUTER_IN_ARP_RESOLVE,
+                                      100, match, actions);
                         free(actions);
                         free(match);
                         break;
@@ -1545,6 +1571,35 @@ build_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
             }
         }
     }
+    HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbr) {
+            continue;
+        }
+
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_RESOLVE, 0, "1",
+                      "get_arp(outport, reg0); next;");
+    }
+
+    /* Local router ingress table 4: ARP request.
+     *
+     * In the common case where the Ethernet destination has been resolved,
+     * this table outputs the packet (priority 100).  Otherwise, it composes
+     * and sends an ARP request (priority 0). */
+    HMAP_FOR_EACH (od, key_node, datapaths) {
+        if (!od->nbr) {
+            continue;
+        }
+
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_REQUEST, 100,
+                      "eth.dst == 00:00:00:00:00:00",
+                      "arp { "
+                      "eth.dst = ff:ff:ff:ff:ff:ff; "
+                      "arp.spa = reg1; "
+                      "arp.op = 1; " /* ARP request */
+                      "output; "
+                      "};");
+        ovn_lflow_add(lflows, od, S_ROUTER_IN_ARP_REQUEST, 0, "1", "output;");
+    }
 
     /* Logical router egress table 0: Delivery (priority 100).
      *
diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml
index 318555b..d59bcc6 100644
--- a/ovn/ovn-architecture.7.xml
+++ b/ovn/ovn-architecture.7.xml
@@ -740,32 +740,62 @@
         <code>ovn-controller</code>'s job is to translate them into equivalent
         OpenFlow (in particular it translates the table numbers:
         <code>Logical_Flow</code> tables 0 through 15 become OpenFlow tables 16
-        through 31).  For a given packet, the logical ingress pipeline
-        eventually executes zero or more <code>output</code> actions:
+        through 31).
       </p>
 
-      <ul>
-        <li>
-          If the pipeline executes no <code>output</code> actions at all, the
-          packet is effectively dropped.
-        </li>
-
-        <li>
-          Most commonly, the pipeline executes one <code>output</code> action,
-          which <code>ovn-controller</code> implements by resubmitting the
-          packet to table 32.
-        </li>
-
-        <li>
-          If the pipeline can execute more than one <code>output</code> action,
-          then each one is separately resubmitted to table 32.  This can be
-          used to send multiple copies of the packet to multiple ports.  (If
-          the packet was not modified between the <code>output</code> actions,
-          and some of the copies are destined to the same hypervisor, then
-          using a logical multicast output port would save bandwidth between
-          hypervisors.)
-        </li>
-      </ul>
+      <p>
+        Most OVN actions have fairly obvious implementations in OpenFlow (with
+        OVS extensions), e.g. <code>next;</code> is implemented as
+        <code>resubmit</code>, <code><var>field</var> =
+        <var>constant</var>;</code> as <code>set_field</code>.  A few are worth
+        describing in more detail:
+      </p>
+
+      <dl>
+        <dt><code>output:</code></dt>
+        <dd>
+          Implemented by resubmitting the packet to table 32.  If the pipeline
+          executes more than one <code>output</code> action, then each one is
+          separately resubmitted to table 32.  This can be used to send
+          multiple copies of the packet to multiple ports.  (If the packet was
+          not modified between the <code>output</code> actions, and some of the
+          copies are destined to the same hypervisor, then using a logical
+          multicast output port would save bandwidth between hypervisors.)
+        </dd>
+
+        <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt>
+        <dd>
+          <p>
+            Implemented by storing arguments into OpenFlow fields, then
+            resubmitting to table 65, which <code>ovn-controller</code>
+            populates with flows generated from the <code>MAC_Binding</code>
+            table in the OVN Southbound database.  If there is a match in table
+            65, then its actions store the bound MAC in the Ethernet
+            destination address field.
+          </p>
+
+          <p>
+            (The OpenFlow actions save and restore the OpenFlow fields used for
+            the arguments, so that the OVN actions do not have to be aware of
+            this temporary use.)
+          </p>
+        </dd>
+
+        <dt><code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code></dt>
+        <dd>
+          <p>
+            Implemented by storing the arguments into OpenFlow fields, then
+            outputting a packet to <code>ovn-controller</code>, which updates
+            the <code>MAC_Binding</code> table.
+          </p>
+
+          <p>
+            (The OpenFlow actions save and restore the OpenFlow fields used for
+            the arguments, so that the OVN actions do not have to be aware of
+            this temporary use.)
+          </p>
+        </dd>
+      </dl>
     </li>
 
     <li>
diff --git a/ovn/ovn-sb.ovsschema b/ovn/ovn-sb.ovsschema
index a9a91e5..ead733b 100644
--- a/ovn/ovn-sb.ovsschema
+++ b/ovn/ovn-sb.ovsschema
@@ -1,7 +1,7 @@
 {
     "name": "OVN_Southbound",
-    "version": "1.0.0",
-    "cksum": "1392129391 5060",
+    "version": "1.1.0",
+    "cksum": "1223981720 5320",
     "tables": {
         "Chassis": {
             "columns": {
@@ -99,6 +99,11 @@
                                  "min": 0,
                                  "max": "unlimited"}}},
             "indexes": [["datapath", "tunnel_key"], ["logical_port"]],
-            "isRoot": true}
-    }
-}
+            "isRoot": true},
+        "MAC_Binding": {
+            "columns": {
+                "logical_port": {"type": "string"},
+                "ip": {"type": "string"},
+                "mac": {"type": "string"}},
+            "indexes": [["logical_port", "ip"]],
+            "isRoot": true}}}
diff --git a/ovn/ovn-sb.xml b/ovn/ovn-sb.xml
index 8ce2e74..7222c37 100644
--- a/ovn/ovn-sb.xml
+++ b/ovn/ovn-sb.xml
@@ -17,7 +17,7 @@
   <h2>Database Structure</h2>
 
   <p>
-    The OVN Southbound database contains three classes of data with
+    The OVN Southbound database contains classes of data with
     different properties, as described in the sections below.
   </p>
 
@@ -77,17 +77,17 @@
     data.
   </p>
 
-  <h3>Bindings data</h3>
+  <h3>Logical-physical bindings</h3>
 
   <p>
-    Bindings data link logical and physical components.  They show the current
+    These tables link logical and physical components.  They show the current
     placement of logical components (such as VMs and VIFs) onto chassis, and
     map logical entities to the values that represent them in tunnel
     encapsulations.
   </p>
 
   <p>
-    Bindings change frequently, at least every time a VM powers up or down
+    These tables change frequently, at least every time a VM powers up or down
     or migrates, and especially quickly in a container environment.  The
     amount of data per VM (or VIF) is small.
   </p>
@@ -103,6 +103,17 @@
     contain binding data.
   </p>
 
+  <h3>MAC bindings</h3>
+
+  <p>
+    The <ref table="MAC_Binding"/> table tracks the bindings from IP addresses
+    to Ethernet addresses that are dynamically discovered using ARP (for IPv4)
+    and neighbor discovery (for IPv6).  Usually, IP-to-MAC bindings for virtual
+    machines are statically populated into the <ref table="Port_Binding"/>
+    table, so <ref table="MAC_Binding"/> is primarily used to discover bindings
+    on physical networks.
+  </p>
+
   <h2>Common Columns</h2>
 
   <p>
@@ -942,6 +953,43 @@
           <p><b>Prerequisite:</b> <code>ip4</code></p>
         </dd>
 
+        <dt><code>get_arp(<var>P</var>, <var>A</var>);</code></dt>
+
+        <dd>
+          <p>
+            <b>Parameters</b>: logical port string field <var>P</var>, 32-bit
+            IP address field <var>A</var>.
+          </p>
+
+          <p>
+            Looks up <var>A</var> in <var>P</var>'s ARP table.  If an entry is
+            found, stores its Ethernet address in <code>eth.dst</code>,
+            otherwise stores <code>00:00:00:00:00:00</code> in
+            <code>eth.dst</code>.
+          </p>
+
+          <p><b>Example:</b> <code>get_arp(outport, ip4.dst);</code></p>
+        </dd>
+
+        <dt>
+          <code>put_arp(<var>P</var>, <var>A</var>, <var>E</var>);</code>
+        </dt>
+
+        <dd>
+          <p>
+            <b>Parameters</b>: logical port string field <var>P</var>, 32-bit
+            IP address field <var>A</var>, 48-bit Ethernet address field
+            <var>E</var>.
+          </p>
+
+          <p>
+            Adds or updates the entry for IP address <var>A</var> in logical
+            port <var>P</var>'s ARP table, setting its Ethernet address to
+            <var>E</var>.
+          </p>
+
+          <p><b>Example:</b> <code>put_arp(inport, arp.spa, arp.sha);</code></p>
+        </dd>
       </dl>
 
       <p>
@@ -1334,4 +1382,85 @@ tcp.flags = RST;
       </column>
     </group>
   </table>
+
+  <table name="MAC_Binding" title="IP to MAC bindings">
+    <p>
+      Each row in this table specifies a binding from an IP address to an
+      Ethernet address that has been discovered through ARP (for IPv4) or
+      neighbor discovery (for IPv6).  This table is primarily used to discover
+      bindings on physical networks, because IP-to-MAC bindings for virtual
+      machines are usually populated statically into the <ref
+      table="Port_Binding"/> table.
+    </p>
+
+    <p>
+      This table expresses a functional relationship: <ref
+      table="MAC_Binding"/>(<ref column="logical_port"/>, <ref column="ip"/>) =
+      <ref column="mac"/>.
+    </p>
+
+    <p>
+      In outline, the lifetime of a logical router's MAC binding looks like
+      this:
+    </p>
+
+    <ol>
+      <li>
+        On hypervisor 1, a logical router determines that a packet should be
+        forwarded to IP address <var>A</var> on one of its router ports.  It
+        uses its logical flow table to determine that <var>A</var> lacks a
+        static IP-to-MAC binding and the <code>get_arp</code> action to
+        determine that it lacks a dynamic IP-to-MAC binding.
+      </li>
+
+      <li>
+        Using an OVN logical <code>arp</code> action, the logical router
+        generates and sends a broadcast ARP request to the router port.  It
+        drops the IP packet.
+      </li>
+
+      <li>
+        The logical switch attached to the router port delivers the ARP request
+        to all of its ports.  (It might make sense to deliver it only to ports
+        that have no static IP-to-MAC bindings, but this could also be
+        surprising behavior.)
+      </li>
+
+      <li>
+        A host or VM on hypervisor 2 (which might be the same as hypervisor 1)
+        attached to the logical switch owns the IP address in question.  It
+        composes an ARP reply and unicasts it to the logical router port's
+        Ethernet address.
+      </li>
+
+      <li>
+        The logical switch delivers the ARP reply to the logical router port.
+      </li>
+
+      <li>
+        The logical router flow table executes a <code>put_arp</code> action.
+        To record the IP-to-MAC binding, <code>ovn-controller</code> adds a row
+        to the <ref table="MAC_Binding"/> table.
+      </li>
+
+      <li>
+        On hypervisor 1, <code>ovn-controller</code> receives the updated <ref
+        table="MAC_Binding"/> table from the OVN southbound database.  The next
+        packet destined to <var>A</var> through the logical router is sent
+        directly to the bound Ethernet address.
+      </li>
+    </ol>
+
+    <column name="logical_port">
+      The logical port on which the binding was discovered.
+    </column>
+
+    <column name="ip">
+      The bound IP address.
+    </column>
+
+    <column name="mac">
+      The Ethernet address to which the IP is bound.
+    </column>
+  </table>
 </database>
diff --git a/ovn/utilities/ovn-sbctl.c b/ovn/utilities/ovn-sbctl.c
index cf3c559..3f381b1 100644
--- a/ovn/utilities/ovn-sbctl.c
+++ b/ovn/utilities/ovn-sbctl.c
@@ -772,6 +772,10 @@ static const struct ctl_table_class tables[] = {
      {{&sbrec_table_port_binding, &sbrec_port_binding_col_logical_port, NULL},
       {NULL, NULL, NULL}}},
 
+    {&sbrec_table_mac_binding,
+     {{&sbrec_table_mac_binding, &sbrec_mac_binding_col_logical_port, NULL},
+      {NULL, NULL, NULL}}},
+
     {NULL, {{NULL, NULL, NULL}, {NULL, NULL, NULL}}}
 };
 
diff --git a/tests/ovn.at b/tests/ovn.at
index de56c9f..26d26ac 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -510,6 +510,21 @@ ct_commit; => actions=ct(commit,zone=NXM_NX_REG5[0..15]), prereqs=ip
 # arp
 arp { eth.dst = ff:ff:ff:ff:ff:ff; output; }; => actions=arp(set_field:ff:ff:ff:ff:ff:ff->eth_dst,resubmit(,64)), prereqs=ip4
 
+# get_arp
+get_arp(outport, ip4.dst); => actions=push:NXM_NX_REG0[],push:NXM_OF_IP_DST[],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,65),pop:NXM_NX_REG0[], prereqs=eth.type == 0x800
+get_arp(inport, reg0); => actions=push:NXM_NX_REG7[],push:NXM_NX_REG0[],push:OXM_OF_PKT_REG0[32..63],push:NXM_NX_REG6[],pop:NXM_NX_REG7[],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,65),pop:NXM_NX_REG0[],pop:NXM_NX_REG7[], prereqs=1
+get_arp; => Syntax error at `;' expecting `('.
+get_arp(); => Syntax error at `)' expecting field name.
+get_arp(inport); => Syntax error at `)' expecting `,'.
+get_arp(inport ip4.dst); => Syntax error at `ip4.dst' expecting `,'.
+get_arp(inport, ip4.dst; => Syntax error at `;' expecting `)'.
+get_arp(inport, eth.dst); => Cannot use 48-bit field eth.dst[0..47] where 32-bit field is required.
+get_arp(inport, outport); => Cannot use string field outport where numeric field is required.
+get_arp(reg0, ip4.dst); => Cannot use numeric field reg0 where string field is required.
+
+# put_arp
+put_arp(inport, arp.spa, arp.sha); => actions=push:NXM_NX_REG1[],push:NXM_OF_ETH_SRC[],push:NXM_NX_ARP_SHA[],push:NXM_OF_ARP_SPA[],pop:NXM_NX_REG1[],pop:NXM_OF_ETH_SRC[],push:NXM_NX_REG0[],set_field:0xbd9c9810->reg0,controller(reason=packet_out),pop:NXM_NX_REG0[],pop:NXM_OF_ETH_SRC[],pop:NXM_NX_REG1[], prereqs=eth.type == 0x806 && eth.type == 0x806
+
 # Contradictionary prerequisites (allowed but not useful):
 ip4.src = ip6.src[0..31]; => actions=move:NXM_NX_IPV6_SRC[0..31]->NXM_OF_IP_SRC[], prereqs=eth.type == 0x800 && eth.type == 0x86dd
 ip4.src <-> ip6.src[0..31]; => actions=push:NXM_NX_IPV6_SRC[0..31],push:NXM_OF_IP_SRC[],pop:NXM_NX_IPV6_SRC[0..31],pop:NXM_OF_IP_SRC[], prereqs=eth.type == 0x800 && eth.type == 0x86dd
@@ -930,9 +945,13 @@ for i in 1 2 3; do
     ovn-nbctl lswitch-add ls$i
     for j in 1 2 3; do
         for k in 1 2 3; do
+	    # Add "unknown" to MAC addresses for lp?11, so packets for
+	    # MAC-IP bindings discovered via ARP later have somewhere to go.
+	    if test $j$k = 11; then unknown=unknown; else unknown=; fi
+
 	    ovn-nbctl \
 		-- lport-add ls$i lp$i$j$k \
-		-- lport-set-addresses lp$i$j$k "f0:00:00:00:0$i:$j$k 192.168.$i$j.$k"
+		-- lport-set-addresses lp$i$j$k "f0:00:00:00:0$i:$j$k 192.168.$i$j.$k" $unknown
         done
     done
 done
@@ -1022,7 +1041,7 @@ sleep 1
 # content has Ethernet destination DST and source SRC (each exactly 12 hex
 # digits) and Ethernet type ETHTYPE (4 hex digits).  The OUTPORTs (zero or
 # more) list the VIFs on which the packet should be received.  INPORT and the
-# OUTPORTs are specified as lport numbers, e.g. 11 for vif11.
+# OUTPORTs are specified as lport numbers, e.g. 123 for vif123.
 trim_zeros() {
     sed 's/\(00\)\{1,\}$//'
 }
@@ -1096,6 +1115,48 @@ for is in 1 2 3; do
     done
 done
 
+# 3. Send an IP packet from every logical port to every other subnet,
+#    to an IP address that does not have a static IP-MAC binding.
+#    This should generate a broadcast ARP request for the destination
+#    IP address in the destination subnet.
+for is in 1 2 3; do
+    for js in 1 2 3; do
+	for ks in 1 2 3; do
+	    s=$is$js$ks
+	    smac=f00000000$s
+	    sip=`ip_to_hex 192 168 $is$js $ks`
+	    for id in 1 2 3; do
+		for jd in 1 2 3; do
+		    if test $is$js = $id$jd; then
+		        continue
+                    fi
+
+		    # Send the packet.
+		    dmac=00000000ff$is$js
+		    # Calculate a 4th octet for the destination that is
+		    # unique per $s, avoids the .1 .2 .3 and .254 IP addresses
+		    # that have static MAC bindings, and fits in the range
+		    # 0-255.
+		    o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10`
+		    dip=`ip_to_hex 192 168 $id$jd $o4`
+		    test_ip $s $smac $dmac $sip $dip
+
+		    # Every LP on the destination subnet's lswitch should
+		    # receive the ARP request.
+		    lrmac=00000000ff$id$jd
+		    lrip=`ip_to_hex 192 168 $id$jd 254`
+		    arp=ffffffffffff${lrmac}08060001080006040001${lrmac}${lrip}000000000000${dip}
+		    for jd2 in 1 2 3; do
+                        for kd in 1 2 3; do
+		            echo $arp | trim_zeros >> $id$jd2$kd.expected
+			done
+		    done
+		done
+            done
+        done
+    done
+done
+
 # test_arp INPORT SHA SPA TPA [REPLY_HA]
 #
 # Causes a packet to be received on INPORT.  The packet is an ARP
@@ -1119,7 +1180,7 @@ test_arp() {
     local j k
     for j in 1 2 3; do
         for k in 1 2 3; do
-            # 192.168.33.254 is configured to the lswtich patch port for lrp33,
+            # 192.168.33.254 is configured to the lswitch patch port for lrp33,
             # so no ARP flooding expected for it.
             if test $i$j$k != $inport && test $tpa != `ip_to_hex 192 168 33 254`; then
                 echo $request >> $i$j$k.expected
@@ -1137,14 +1198,14 @@ test_arp() {
 
 # Test router replies to ARP requests from all source ports:
 #
-# 3. Router replies to query for its MAC address from port's own IP address.
+# 4. Router replies to query for its MAC address from port's own IP address.
 #
-# 4. Router replies to query for its MAC address from any random IP address
+# 5. Router replies to query for its MAC address from any random IP address
 #    in its subnet.
 #
-# 5. Router replies to query for its MAC address from another subnet.
+# 6. Router replies to query for its MAC address from another subnet.
 #
-# 6. No reply to query for IP address other than router IP.
+# 7. No reply to query for IP address other than router IP.
 for i in 1 2 3; do
     for j in 1 2 3; do
         for k in 1 2 3; do
@@ -1153,13 +1214,97 @@ for i in 1 2 3; do
 	    rip=`ip_to_hex 192 168 $i$j 254`   # Router IP
 	    rmac=00000000ff$i$j                # Router MAC
 	    otherip=`ip_to_hex 192 168 $i$j 55` # Some other IP in subnet
-	    test_arp $i$j$k $smac $sip        $rip        $rmac #3
-	    test_arp $i$j$k $smac $otherip    $rip        $rmac #4
-	    test_arp $i$j$k $smac 0a123456    $rip        $rmac #5
-	    test_arp $i$j$k $smac $sip        $otherip          #6
+	    test_arp $i$j$k $smac $sip        $rip        $rmac #4
+	    test_arp $i$j$k $smac $otherip    $rip        $rmac #5
+	    test_arp $i$j$k $smac 0a123456    $rip        $rmac #6
+	    test_arp $i$j$k $smac $sip        $otherip          #7
+        done
+    done
+done
+
+# Allow some time for packet forwarding.
+# XXX This can be improved.
+sleep 1
+
+# Generate an ARP reply for each of the IP addresses ARPed for earlier as #3.
+: > mac_bindings.expected
+for is in 1 2 3; do
+    for js in 1 2 3; do
+	for ks in 1 2 3; do
+	    s=$is$js$ks
+	    for id in 1 2 3; do
+		for jd in 1 2 3; do
+		    if test $is$js = $id$jd; then
+		        continue
+                    fi
+
+		    kd=1
+		    d=$id$jd$kd
+
+		    o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10`
+		    host_ip=`ip_to_hex 192 168 $id$jd $o4`
+		    host_mac=8000000000$o4
+
+		    lrmac=00000000ff$id$jd
+		    lrip=`ip_to_hex 192 168 $id$jd 254`
+
+		    arp=${lrmac}${host_mac}08060001080006040002${host_mac}${host_ip}${lrmac}${lrip}
+
+                    hv=hv`vif_to_hv $d`
+		    as $hv ovs-appctl netdev-dummy/receive vif$d $arp
+
+		    host_ip_pretty=192.168.$id$jd.$o4
+		    host_mac_pretty=80:00:00:00:00:$o4
+		    echo lrp$id$jd,$host_ip_pretty,$host_mac_pretty >> mac_bindings.expected
+		done
+            done
+        done
+    done
+done
+
+# Allow some time for packet forwarding.
+# XXX This can be improved.
+sleep 1
+
+# 8. Send an IP packet from every logical port to every other subnet.  These
+#    are the same packets already sent as #3, but now the destinations' IP-MAC
+#    bindings have been discovered via ARP, so instead of provoking an ARP
+#    request, these packets now get routed to their destinations (which don't
+#    have static MAC bindings, so they go to the port we've designated as
+#    accepting "unknown" MACs.)
+for is in 1 2 3; do
+    for js in 1 2 3; do
+	for ks in 1 2 3; do
+	    s=$is$js$ks
+	    smac=f00000000$s
+	    sip=`ip_to_hex 192 168 $is$js $ks`
+	    for id in 1 2 3; do
+		for jd in 1 2 3; do
+		    if test $is$js = $id$jd; then
+		        continue
+                    fi
+
+		    # Send the packet.
+		    dmac=00000000ff$is$js
+		    # Calculate a 4th octet for the destination that is
+		    # unique per $s, avoids the .1 .2 .3 and .254 IP addresses
+		    # that have static MAC bindings, and fits in the range
+		    # 0-255.
+		    o4=`expr $is '*' 9 + $js '*' 3 + $ks + 10`
+		    dip=`ip_to_hex 192 168 $id$jd $o4`
+		    test_ip $s $smac $dmac $sip $dip
+
+		    # Expect the packet egress.
+		    host_mac=8000000000$o4
+		    outport=${id}11
+                    out_lrp=$id$jd
+                    echo ${host_mac}00000000ff${out_lrp}08004500001c00000000"3f1101"00${sip}${dip}0035111100080000 | trim_zeros >> $outport.expected
+		done
+            done
         done
     done
 done
+
 # Allow some time for packet forwarding.
 # XXX This can be improved.
 sleep 1
@@ -1177,4 +1322,11 @@ for i in 1 2 3; do
         done
     done
 done
+
+# Check the MAC bindings against those expected.
+sort < mac_bindings.expected > expout
+AT_CHECK([ovn-sbctl -f csv -d bare --no-heading \
+	      -- --columns=logical_port,ip,mac list mac_binding | sort], [0],
+	 [expout])
+
 AT_CLEANUP
diff --git a/tests/test-ovn.c b/tests/test-ovn.c
index 3e43cd8..0e01571 100644
--- a/tests/test-ovn.c
+++ b/tests/test-ovn.c
@@ -1249,6 +1249,7 @@ test_parse_actions(struct ovs_cmdl_context *ctx OVS_UNUSED)
             .first_ptable = 16,
             .cur_ltable = 10,
             .output_ptable = 64,
+            .arp_ptable = 65,
         };
         error = actions_parse_string(ds_cstr(&input), &ap, &ofpacts, &prereqs);
         if (!error) {
-- 
2.1.3




More information about the dev mailing list