[ovs-dev] [PATCH ovn v2 3/3] Add ECMP symmetric replies.

Numan Siddique numans at ovn.org
Fri Jul 24 12:40:40 UTC 2020


On Fri, Jul 24, 2020 at 1:07 AM Mark Michelson <mmichels at redhat.com> wrote:

> When traffic arrives over an ECMP route, there is no guarantee that the
> reply traffic will egress over the same route. Sometimes, the nature of
> the traffic (or the intervening equipment) means that it is important
> for reply traffic to go out the same route it came in.
>
> This commit introduces optional ECMP symmetric reply behavior. If
> configured, then traffic to or from the ECMP route will be sent to
> conntrack. New incoming traffic over the route will have the source MAC
> address and incoming port saved in the ct_label. Reply traffic then uses
> this saved information to send the packet back out the same way it came
> in.
>
> To facilitate this, a new table was added to the ingress logical router
> pipeline. The ECMP_STATEFUL table is responsible for committing to
> conntrack and setting the ct_label when it detects new incoming traffic
> from the route.
>
> Signed-off-by: Mark Michelson <mmichels at redhat.com>
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683


Hi Mark,

There's a checkpatch warning which you want to address:

---
WARNING: Line is 80 characters long (recommended limit is 79)
#228 FILE: northd/ovn-northd.c:7783:
    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port == %"
PRId64,

---
Please note that I didn't do a thorough code review.
I tested this patch using ovn-fake-multi node setup and created the OVN
resources as it's done in system-ovn.at.
In my testing, alice1 is claimed by ovn-chassis-2 and bob1 is claimed by
ovn-chassis-1.
R2 and R3 are claimed on chassis - ovn-chassis-1.

I see few issues. It's not working as expected. I started nc server
listening on port 80 on alice1.

 1. when bob1 connects to alice1 IP, I see that the below logical flow gets
hit on ovn-chassis-1 where bob1 resides
    table=7 (lr_in_ecmp_stateful), priority=100  , match=(inport ==
"R1_join" && ip4.dst == 10.0.0.0/24 && (ct.new && !ct.est)),
action=(ct_commit { ct_label.ecmp_reply_eth = eth.src;
ct_label.ecmp_reply_port = 2;}; next;)

 2. But for the reply traffic from alice1 to bob1, the below expected
logical flow is NOT hit
    table=10(lr_in_ip_routing   ), priority=100  , match=(ct.rpl &&
ct_label.ecmp_reply_port == 2 && ip4.src == 10.0.0.0/24), action=(ip.ttl--;
flags.loopback = 1; eth.src = 00:00:04:01:02:03; reg1 = 20.0.0.1; outport =
"R1_join"; next;)

3. Instead the normal ecmp select flow is hit in ovn-chassis-2 where alice1
resides.

4. #conntrack -L | grep 172.16.0.1 # on ovn-chassis-1
    tcp      6 117 SYN_SENT src=172.16.0.1 dst=10.0.0.2 sport=43644
dport=80 [UNREPLIED] src=10.0.0.2 dst=172.16.0.1 sport=80 dport=43644
mark=0 secctx=system_u:object_r:unlabeled_t:s0 zone=6 use=1

    #ovs-appctl -t ovs-vswitchd dpctl/dump-conntrack | grep 172.16.0.1 # on
ovn-chassis-1
tcp,orig=(src=172.16.0.1,dst=10.0.0.2,sport=43660,dport=80),reply=(src=10.0.0.2,dst=172.16.0.1,sport=80,dport=43660),zone=6,labels=0x2000004010204,protoinfo=(state=SYN_SENT)

The conntrack state is in SYN_SENT state.

I think this needs to be handled properly.

Please see below for a few comments.



>
> ---
>  lib/logical-fields.c      |   4 ++
>  northd/ovn-northd.8.xml   |  49 ++++++++++---
>  northd/ovn-northd.c       | 119 +++++++++++++++++++++++++++----
>  ovn-architecture.7.xml    |   7 +-
>  ovn-nb.ovsschema          |   5 +-
>  ovn-nb.xml                |  14 ++++
>  tests/ovn.at              |  28 ++++----
>  tests/system-ovn.at       | 144 ++++++++++++++++++++++++++++++++++++++
>  utilities/ovn-nbctl.8.xml |  31 ++++++--
>  utilities/ovn-nbctl.c     |  18 ++++-
>  10 files changed, 367 insertions(+), 52 deletions(-)
>
> diff --git a/lib/logical-fields.c b/lib/logical-fields.c
> index 8639523ea..4c129814d 100644
> --- a/lib/logical-fields.c
> +++ b/lib/logical-fields.c
> @@ -127,6 +127,10 @@ ovn_init_symtab(struct shash *symtab)
>
>      expr_symtab_add_field(symtab, "ct_label", MFF_CT_LABEL, NULL, false);
>      expr_symtab_add_subfield(symtab, "ct_label.blocked", NULL,
> "ct_label[0]");
> +    expr_symtab_add_subfield(symtab, "ct_label.ecmp_reply_eth", NULL,
> +                             "ct_label[0..47]");
>

We already have ct_labal.blocked mapped to ct_label[0]. In the patch which I
proposed here - [2], it is using ct_label.natted to ct_label[1].

Even though ct_label.blocked is used in the logical switch pipeline, I
think for clarity
it's better to shift ct_label.ecmp_reply_eth to starting from 32 offset
maybe ?


+    expr_symtab_add_subfield(symtab, "ct_label.ecmp_reply_port", NULL,
> +                             "ct_label[48..63]");
>

Same here.

[2] -
https://patchwork.ozlabs.org/project/openvswitch/patch/20200722192154.9065-1-numans@ovn.org/


Thanks
Numan


>      expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL, false);
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index eb2514f15..cf251e02a 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -2120,15 +2120,31 @@ icmp6 {
>      <p>
>        This is to send packets to connection tracker for tracking and
>        defragmentation.  It contains a priority-0 flow that simply moves
> traffic
> -      to the next table.  If load balancing rules with virtual IP
> addresses
> -      (and ports) are configured in <code>OVN_Northbound</code> database
> for a
> -      Gateway router, a priority-100 flow is added for each configured
> virtual
> -      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
> -      <code>ip && ip4.dst == <var>VIP</var></code>.  For IPv6
> -      <var>VIPs</var>, the flow matches <code>ip && ip6.dst ==
> -      <var>VIP</var></code>.  The flow uses the action
> <code>ct_next;</code>
> -      to send IP packets to the connection tracker for packet
> de-fragmentation
> -      and tracking before sending it to the next table.
> +      to the next table.
> +    </p>
> +
> +    <p>
> +      If load balancing rules with virtual IP addresses (and ports) are
> +      configured in <code>OVN_Northbound</code> database for a Gateway
> router,
> +      a priority-100 flow is added for each configured virtual IP address
> +      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches <code>ip
> +      && ip4.dst == <var>VIP</var></code>.  For IPv6
> <var>VIPs</var>,
> +      the flow matches <code>ip && ip6.dst ==
> <var>VIP</var></code>.
> +      The flow uses the action <code>ct_next;</code> to send IP packets
> to the
> +      connection tracker for packet de-fragmentation and tracking before
> +      sending it to the next table.
> +    </p>
> +
> +    <p>
> +      If ECMP routes with symmetric reply are configured in the
> +      <code>OVN_Northbound</code> database for a gateway router, a
> priority-100
> +      flow is added for each router port on which symmetric replies are
> +      configured. The matching logic for these ports essentially reverses
> the
> +      configured logic of the ECMP route. So for instance, a route with a
> +      destination routing policy will instead match if the source IP
> address
> +      matches the static route's prefix. The flow uses the action
> +      <code>ct_next</code> to send IP packets to the connection tracker
> for
> +      packet de-fragmentation and tracking before sending it to the next
> table.
>      </p>
>
>      <h3>Ingress Table 5: UNSNAT</h3>
> @@ -2489,7 +2505,15 @@ output;
>        table.  This table, instead, is responsible for determine the ECMP
>        group id and select a member id within the group based on 5-tuple
>        hashing.  It stores group id in <code>reg8[0..15]</code> and member
> id in
> -      <code>reg8[16..31]</code>.
> +      <code>reg8[16..31]</code>. This step is skipped if the traffic going
> +      out the ECMP route is reply traffic, and the ECMP route was
> configured
> +      to use symmetric replies. Instead, the stored <code>ct_label</code>
> value
> +      is used to choose the destination. The least significant 48 bits of
> the
> +      <code>ct_label</code> tell the destination MAC address to which the
> +      packet should be sent. The next 16 bits tell the logical router
> port on
> +      which the packet should be sent. These values in the
> +      <code>ct_label</code> are set when the initial ingress traffic is
> +      received over the ECMP route.
>      </p>
>
>      <p>
> @@ -2639,6 +2663,11 @@ select(reg8[16..31], <var>MID1</var>,
> <var>MID2</var>, ...);
>        address and <code>reg1</code> as the source protocol address).
>      </p>
>
> +    <p>
> +      This processing is skipped for reply traffic being sent out of an
> ECMP
> +      route if the route was configured to use symmetric replies.
> +    </p>
> +
>      <p>
>        This table contains the following logical flows:
>      </p>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index d10e5ee5d..06881edd3 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -172,16 +172,17 @@ enum ovn_stage {
>      PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4, "lr_in_defrag")
>  \
>      PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5, "lr_in_unsnat")
>  \
>      PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7,
> "lr_in_nd_ra_options") \
> -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8,
> "lr_in_nd_ra_response") \
> -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9, "lr_in_ip_routing")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> "lr_in_ip_routing_ecmp") \
> -    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11, "lr_in_policy")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> "lr_in_arp_resolve")  \
> -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> "lr_in_chk_pkt_len")   \
> -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     14,"lr_in_larger_pkts")
>  \
> -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> "lr_in_gw_redirect")  \
> -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> "lr_in_arp_request")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7,
> "lr_in_ecmp_stateful") \
> +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8,
> "lr_in_nd_ra_options") \
> +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9,
> "lr_in_nd_ra_response") \
> +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10, "lr_in_ip_routing")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11,
> "lr_in_ip_routing_ecmp") \
> +    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12, "lr_in_policy")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> "lr_in_arp_resolve")  \
> +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> "lr_in_chk_pkt_len")   \
> +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,     15,"lr_in_larger_pkts")
>  \
> +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> "lr_in_gw_redirect")  \
> +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> "lr_in_arp_request")  \
>                                                                        \
>      /* Logical router egress stages. */                               \
>      PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")        \
> @@ -7312,6 +7313,7 @@ struct parsed_route {
>      bool is_src_route;
>      uint32_t hash;
>      const struct nbrec_logical_router_static_route *route;
> +    bool ecmp_symmetric_reply;
>  };
>
>  static uint32_t
> @@ -7373,6 +7375,8 @@ parsed_routes_add(struct ovs_list *routes,
>                                                   "src-ip"));
>      pr->hash = route_hash(pr);
>      pr->route = route;
> +    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
> +                                             "ecmp_symmetric_reply",
> false);
>      ovs_list_insert(routes, &pr->list_node);
>      return pr;
>  }
> @@ -7621,18 +7625,95 @@ find_static_route_outport(struct ovn_datapath *od,
> struct hmap *ports,
>      return true;
>  }
>
> +static void
> +add_ecmp_symmetric_reply_flows(struct hmap *lflows,
> +                               struct ovn_datapath *od,
> +                               const char *port_ip,
> +                               struct ovn_port *out_port,
> +                               const struct parsed_route *route,
> +                               struct ds *route_match)
> +{
> +    const struct nbrec_logical_router_static_route *st_route =
> route->route;
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    struct ds actions = DS_EMPTY_INITIALIZER;
> +    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
> +    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
> +
> +    /* If symmetric ECMP replies are enabled, then packets that arrive
> over
> +     * an ECMP route need to go through conntrack.
> +     */
> +    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
> +                  out_port->json_key,
> +                  route->prefix.family == AF_INET ? "4" : "6",
> +                  route->is_src_route ? "dst" : "src",
> +                  cidr);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> +                            ds_cstr(&match), "ct_next;",
> +                            &st_route->header_);
> +
> +    /* And packets that go out over an ECMP route need conntrack */
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> +                            ds_cstr(route_match), "ct_next;",
> +                            &st_route->header_);
> +
> +    /* Save src eth and inport in ct_label for packets that arrive over
> +     * an ECMP route.
> +     *
> +     * NOTE: we purposely are not clearing match before this
> +     * ds_put_cstr() call. The previous contents are needed.
> +     */
> +    ds_put_cstr(&match, " && (ct.new && !ct.est)");
> +
> +    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth =
> eth.src;"
> +                  " ct_label.ecmp_reply_port = %" PRId64 ";}; next;",
> +                  out_port->sb->tunnel_key);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 100,
> +                            ds_cstr(&match), ds_cstr(&actions),
> +                            &st_route->header_);
> +
> +    /* Bypass ECMP selection if we already have ct_label information
> +     * for where to route the packet.
> +     */
> +    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port == %"
> PRId64,
> +                  out_port->sb->tunnel_key);
> +    ds_clear(&match);
> +    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
> +                  ds_cstr(route_match));
> +    ds_clear(&actions);
> +    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
> +                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
> +                  out_port->lrp_networks.ea_s,
> +                  route->prefix.family == AF_INET ? "" : "xx",
> +                  port_ip, out_port->json_key);
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
> +                           ds_cstr(&match), ds_cstr(&actions),
> +                           &st_route->header_);
> +
> +    /* Egress reply traffic for symmetric ECMP routes skips router
> policies. */
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
> +                            ds_cstr(&ecmp_reply), "next;",
> +                            &st_route->header_);
> +
> +    ds_clear(&actions);
> +    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth; next;");
> +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
> +                            200, ds_cstr(&ecmp_reply),
> +                            ds_cstr(&actions), &st_route->header_);
> +}
> +
>  static void
>  build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
>                        struct hmap *ports, struct ecmp_groups_node *eg)
>
>  {
>      bool is_ipv4 = (eg->prefix.family == AF_INET);
> -    struct ds match = DS_EMPTY_INITIALIZER;
>      uint16_t priority;
> +    struct ecmp_route_list_node *er;
> +    struct ds route_match = DS_EMPTY_INITIALIZER;
>
>      char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
>      build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route, is_ipv4,
> -                      &match, &priority);
> +                      &route_match, &priority);
>      free(prefix_s);
>
>      struct ds actions = DS_EMPTY_INITIALIZER;
> @@ -7640,7 +7721,6 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                    "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
>                    REG_ECMP_MEMBER_ID);
>
> -    struct ecmp_route_list_node *er;
>      bool is_first = true;
>      LIST_FOR_EACH (er, list_node, &eg->route_list) {
>          if (is_first) {
> @@ -7654,11 +7734,12 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>      ds_put_cstr(&actions, ");");
>
>      ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
> -                  ds_cstr(&match), ds_cstr(&actions));
> +                  ds_cstr(&route_match), ds_cstr(&actions));
>
>      /* Add per member flow */
> +    struct ds match = DS_EMPTY_INITIALIZER;
> +    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
>      LIST_FOR_EACH (er, list_node, &eg->route_list) {
> -
>          const struct parsed_route *route_ = er->route;
>          const struct nbrec_logical_router_static_route *route =
> route_->route;
>          /* Find the outgoing port. */
> @@ -7668,6 +7749,11 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                                         &out_port)) {
>              continue;
>          }
> +        if (route_->ecmp_symmetric_reply && sset_add(&visited_ports,
> +                                                 out_port->key)) {
> +            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s,
> out_port,
> +                                           route_, &route_match);
> +        }
>          ds_clear(&match);
>          ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
>                        REG_ECMP_MEMBER_ID" == %"PRIu16,
> @@ -7688,7 +7774,9 @@ build_ecmp_route_flow(struct hmap *lflows, struct
> ovn_datapath *od,
>                                  ds_cstr(&match), ds_cstr(&actions),
>                                  &route->header_);
>      }
> +    sset_destroy(&visited_ports);
>      ds_destroy(&match);
> +    ds_destroy(&route_match);
>      ds_destroy(&actions);
>  }
>
> @@ -8972,6 +9060,7 @@ build_lrouter_flows(struct hmap *datapaths, struct
> hmap *ports,
>          ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1", "next;");
>          ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1", "next;");
>          ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1", "next;");
> +        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0, "1",
> "next;");
>
>          /* Send the IPv6 NS packets to next table. When ovn-controller
>           * generates IPv6 NS (for the action - nd_ns{}), the injected
> diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> index 246cebc19..b1a462933 100644
> --- a/ovn-architecture.7.xml
> +++ b/ovn-architecture.7.xml
> @@ -1210,11 +1210,12 @@
>      <dd>
>        Fields that denote the connection tracking zones for routers.  These
>        values only have local significance and are not meaningful between
> -      chassis.  OVN stores the zone information for DNATting in Open
> vSwitch
> +      chassis.  OVN stores the zone information for north to south traffic
> +      (for DNATting or ECMP symmetric replies) in Open vSwitch
>          <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
>          MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
> -      extension register number 11 and zone information for SNATing in
> -      Open vSwitch extension register number 12.
> +      extension register number 11 and zone information for south to north
> +      traffic (for SNATing) in Open vSwitch extension register number 12.
>      </dd>
>
>      <dt>logical flow flags</dt>
> diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
> index da9af7157..16f7794f2 100644
> --- a/ovn-nb.ovsschema
> +++ b/ovn-nb.ovsschema
> @@ -1,7 +1,7 @@
>  {
>      "name": "OVN_Northbound",
>      "version": "5.24.0",
> -    "cksum": "1092394564 25961",
> +    "cksum": "679745602 26116",
>      "tables": {
>          "NB_Global": {
>              "columns": {
> @@ -365,6 +365,9 @@
>                                      "min": 0, "max": 1}},
>                  "nexthop": {"type": "string"},
>                  "output_port": {"type": {"key": "string", "min": 0,
> "max": 1}},
> +                "options": {
> +                    "type": {"key": "string", "value": "string",
> +                             "min": 0, "max": "unlimited"}},
>                  "external_ids": {
>                      "type": {"key": "string", "value": "string",
>                               "min": 0, "max": "unlimited"}}},
> diff --git a/ovn-nb.xml b/ovn-nb.xml
> index db5908cd5..fa804d776 100644
> --- a/ovn-nb.xml
> +++ b/ovn-nb.xml
> @@ -2481,6 +2481,20 @@
>        </column>
>      </group>
>
> +    <group title="Common options">
> +      <column name="options">
> +        This column provides general key/value settings. The supported
> +        options are described individually below.
> +      </column>
> +
> +      <column name="options" key="ecmp_symmetric_reply">
> +        It true, then new traffic that arrives over this route will have
> +        its reply traffic bypass ECMP route selection and will be sent out
> +        this route instead. Note that this option overrides any rules set
> +        in the <ref table="Logical_Router_policy" /> table.
> +      </column>
> +    </group>
> +
>    </table>
>
>    <table name="Logical_Router_Policy" title="Logical router policies">
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 3afdcca1e..1a5ad67cc 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -195,6 +195,8 @@ ct.snat = ct_state[6]
>  ct.trk = ct_state[5]
>  ct_label = NXM_NX_CT_LABEL
>  ct_label.blocked = ct_label[0]
> +ct_label.ecmp_reply_eth = ct_label[0..47]
> +ct_label.ecmp_reply_port = ct_label[48..63]
>  ct_mark = NXM_NX_CT_MARK
>  ct_state = NXM_NX_CT_STATE
>  ]])
> @@ -16056,7 +16058,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # Since the sw0-vir is not claimed by any chassis, eth.dst should be set
> to
>  # zero if the ip4.dst is the virtual ip in the router pipeline.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>
>  ip_to_hex() {
> @@ -16107,7 +16109,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p1's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>
>  # Forcibly clear virtual_parent. ovn-controller should release the binding
> @@ -16148,7 +16150,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p2's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:05; next;)
>  ])
>
>  # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> @@ -16171,7 +16173,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  # There should be an arp resolve flow to resolve the virtual_ip with the
>  # sw0-p3's MAC.
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>
>  # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> @@ -16192,7 +16194,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:03; next;)
>  ])
>
>  # Delete hv1-vif1 port. hv1 should release sw0-vir
> @@ -16210,7 +16212,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 00:00:00:00:00:00; next;)
>  ])
>
>  # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> @@ -16231,7 +16233,7 @@ ovn-sbctl dump-flows lr0 | grep lr_in_arp_resolve
> | grep "reg0 == 10.0.0.10" \
>  > lflows.txt
>
>  AT_CHECK([cat lflows.txt], [0], [dnl
> -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
> +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst = 50:54:00:00:00:04; next;)
>  ])
>
>  # Delete sw0-p2 logical port
> @@ -20265,22 +20267,22 @@ ovn-nbctl set logical_router_policy $pol5
> options:pkt_mark=5
>  ovn-nbctl --wait=hv sync
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x64->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x3->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x4->NXM_NX_PKT_MARK" -c)
>  ])
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x5->NXM_NX_PKT_MARK" -c)
>  ])
>
> @@ -20371,12 +20373,12 @@ send_ipv4_pkt hv1 hv1-vif1 505400000003
> 00000000ff01 \
>      $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
>
>  OVS_WAIT_UNTIL([
> -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x2->NXM_NX_PKT_MARK" -c)
>  ])
>
>  AT_CHECK([
> -    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
>      grep "load:0x64->NXM_NX_PKT_MARK" -c)
>  ])
>
> diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> index eddc530f9..451955bae 100644
> --- a/tests/system-ovn.at
> +++ b/tests/system-ovn.at
> @@ -4483,3 +4483,147 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port
> patch-.*/d
>  /connection dropped.*/d"])
>
>  AT_CLEANUP
> +
> +AT_SETUP([ovn -- ECMP symmetric reply])
> +AT_KEYWORDS([ecmp])
> +
> +CHECK_CONNTRACK()
> +ovn_start
> +
> +OVS_TRAFFIC_VSWITCHD_START()
> +ADD_BR([br-int])
> +
> +# Set external-ids in br-int needed for ovn-controller
> +ovs-vsctl \
> +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> +        -- set Open_vSwitch .
> external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
> +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> +        -- set bridge br-int fail-mode=secure
> other-config:disable-in-band=true
> +
> +# Start ovn-controller
> +start_daemon ovn-controller
> +
> +# Logical network:
> +# Alice is connected to router R1. R1 is connected to two routers, R2 and
> R3 via
> +# a "join" switch.
> +# Bob is connected to both R2 and R3. R1 contains two ECMP routes, one
> through R2
> +# and one through R3, to Bob.
> +#
> +#     alice -- R1 --join---- R2
> +#                     |         \
> +#                     |           bob
> +#                     |         /
> +#                     +----- R3
> +#
> +# For this test, Bob sends request traffic through R2 to Alice. We want
> to ensure that
> +# all response traffic from Alice is routed through R2 as well.
> +
> +ovn-nbctl create Logical_Router name=R1
> +ovn-nbctl create Logical_Router name=R2 options:chassis=hv1
> +ovn-nbctl create Logical_Router name=R3 options:chassis=hv1
> +
> +ovn-nbctl ls-add alice
> +ovn-nbctl ls-add bob
> +ovn-nbctl ls-add join
> +
> +# connect alice to R1
> +ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
> +ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port rp-alice \
> +    type=router options:router-port=alice addresses='"00:00:01:01:02:03"'
> +
> +# connect bob to R2
> +ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
> +ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
> +    type=router options:router-port=R2_bob addresses='"00:00:02:01:02:03"'
> +
> +# connect bob to R3
> +ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
> +ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
> +    type=router options:router-port=R3_bob addresses='"00:00:02:01:02:04"'
> +
> +# Connect R1 to join
> +ovn-nbctl lrp-add R1 R1_join 00:00:04:01:02:03 20.0.0.1/24
> +ovn-nbctl lsp-add join r1-join -- set Logical_Switch_Port r1-join \
> +    type=router options:router-port=R1_join
> addresses='"00:00:04:01:02:03"'
> +
> +# Connect R2 to join
> +ovn-nbctl lrp-add R2 R2_join 00:00:04:01:02:04 20.0.0.2/24
> +ovn-nbctl lsp-add join r2-join -- set Logical_Switch_Port r2-join \
> +    type=router options:router-port=R2_join
> addresses='"00:00:04:01:02:04"'
> +
> +# Connect R3 to join
> +ovn-nbctl lrp-add R3 R3_join 00:00:04:01:02:05 20.0.0.3/24
> +ovn-nbctl lsp-add join r3-join -- set Logical_Switch_Port r3-join \
> +    type=router options:router-port=R3_join
> addresses='"00:00:04:01:02:05"'
> +
> +# Install ECMP routes for alice.
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> 10.0.0.0/24 20.0.0.2
> +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> 10.0.0.0/24 20.0.0.3
> +
> +# Static Routes
> +ovn-nbctl lr-route-add R2 10.0.0.0/24 20.0.0.1
> +ovn-nbctl lr-route-add R3 10.0.0.0/24 20.0.0.1
> +
> +# Logical port 'alice1' in switch 'alice'.
> +ADD_NAMESPACES(alice1)
> +ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24", "f0:00:00:01:02:04", \
> +         "10.0.0.1")
> +ovn-nbctl lsp-add alice alice1 \
> +-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
> +
> +# Logical port 'bob1' in switch 'bob'.
> +ADD_NAMESPACES(bob1)
> +ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16", "f0:00:00:01:02:06", \
> +         "172.16.0.2")
> +ovn-nbctl lsp-add bob bob1 \
> +-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
> +
> +# Ensure ovn-controller is caught up
> +ovn-nbctl --wait=hv sync
> +
> +on_exit 'ovs-ofctl dump-flows br-int'
> +
> +# 'bob1' should be able to ping 'alice1' directly.
> +NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 |
> FORMAT_PING], \
> +[0], [dnl
> +20 packets transmitted, 20 received, 0% packet loss, time 0ms
> +])
> +
> +# Ensure conntrack entry is present. We should not try to predict
> +# the tunnel key for the output port, so we strip it from the labels
> +# and just ensure that the known ethernet address is present.
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) | \
> +sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
> +sed -e 's/labels=0x[[0-9a-f]]*000004010204/labels=0x000004010204/'], [0],
> [dnl
>
> +icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x000004010204
> +])
> +
> +ovs-appctl dpctl/dump-flows
> +
> +# Ensure datapaths show conntrack states as expected
> +# Like with conntrack entries, we shouldn't try to predict
> +# port binding tunnel keys. So omit them from expected labels.
> +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> 'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*000004010204/0xffffffffffffffff)'
> -c], [0], [dnl
> +1
> +])
> +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> 'ct_state(-new+est+rpl+trk).*ct_label(0x.*000004010204/0xffffffffffffffff)'
> -c], [0], [dnl
> +1
> +])
> +
> +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> +
> +as ovn-sb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as ovn-nb
> +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> +
> +as northd
> +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> +
> +as
> +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
> +/connection dropped.*/d"])
> +
> +AT_CLEANUP
> diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
> index de86b70e6..18bf90e08 100644
> --- a/utilities/ovn-nbctl.8.xml
> +++ b/utilities/ovn-nbctl.8.xml
> @@ -658,7 +658,8 @@
>
>      <dl>
>        <dt>[<code>--may-exist</code>]
> [<code>--policy</code>=<var>POLICY</var>]
> -        [<code>--ecmp</code>] <code>lr-route-add</code> <var>router</var>
> +        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
> +        <code>lr-route-add</code> <var>router</var>
>          <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
>        <dd>
>          <p>
> @@ -680,15 +681,31 @@
>            specified, the default is "dst-ip".
>          </p>
>
> +        <p>
> +          The <code>--ecmp</code> option allows for multiple routes with
> the
> +          same <var>prefix</var> <var>POLICY</var> but different
> +          <var>nexthop</var> and <var>port</var> to be added.
> +        </p>
> +
> +        <p>
> +          The <code>--ecmp-symmetric-reply</code> option makes it so that
> +          traffic that arrives over an ECMP route will have its reply
> traffic
> +          sent out over that same route. Setting
> +          <code>--ecmp-symmetric-reply</code> implies <code>--ecmp</code>
> so
> +          it is not necessary to set both.
> +        </p>
> +
>          <p>
>            It is an error if a route with <var>prefix</var> and
> -          <var>POLICY</var> already exists, unless
> <code>--may-exist</code> or
> -          <code>--ecmp</code> is specified.  If <code>--may-exist</code>
> is
> -          specified but not <code>--ecmp</code>, the existed route will be
> -          updated with the new nexthop and port.  If <code>--ecmp</code>
> is
> +          <var>POLICY</var> already exists, unless
> <code>--may-exist</code>,
> +          <code>--ecmp</code>, or <code>--ecmp-symmetric-reply</code> is
> +          specified.  If <code>--may-exist</code> is specified but not
> +          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code>, the
> +          existed route will be updated with the new nexthop and port.  If
> +          <code>--ecmp</code> or <code>--ecmp-symmetric-reply</code> is
>            specified, a new route will be added, regardless of the existed
> -          route, which is useful when adding ECMP routes, i.e. routes
> with same
> -          <var>POLICY</var> and <var>prefix</var> but different
> +          route., which is useful when adding ECMP routes, i.e. routes
> with
> +          same <var>POLICY</var> and <var>prefix</var> but different
>            <var>nexthop</var> and <var>port</var>.
>          </p>
>        </dd>
> diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> index 0079ad5a6..e6d8dbe63 100644
> --- a/utilities/ovn-nbctl.c
> +++ b/utilities/ovn-nbctl.c
> @@ -687,7 +687,8 @@ Logical router port commands:\n\
>                              ('overlay' or 'bridged')\n\
>  \n\
>  Route commands:\n\
> -  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP [PORT]\n\
> +  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add ROUTER
> \n\
> +                            PREFIX NEXTHOP [PORT]\n\
>                              add a route to ROUTER\n\
>    [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP [PORT]]]\n\
>                              remove routes from ROUTER\n\
> @@ -3855,7 +3856,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>      }
>
>      bool may_exist = shash_find(&ctx->options, "--may-exist") != NULL;
> -    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
> +    bool ecmp_symmetric_reply = shash_find(&ctx->options,
> +                                           "--ecmp-symmetric-reply") !=
> NULL;
> +    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
> +                ecmp_symmetric_reply;
>      if (!ecmp) {
>          for (int i = 0; i < lr->n_static_routes; i++) {
>              const struct nbrec_logical_router_static_route *route
> @@ -3920,6 +3924,13 @@ nbctl_lr_route_add(struct ctl_context *ctx)
>          nbrec_logical_router_static_route_set_policy(route, policy);
>      }
>
> +    if (ecmp_symmetric_reply) {
> +        const struct smap options = SMAP_CONST1(&options,
> +                                                "ecmp_symmetric_reply",
> +                                                "true");
> +        nbrec_logical_router_static_route_set_options(route, &options);
> +    }
> +
>      nbrec_logical_router_verify_static_routes(lr);
>      struct nbrec_logical_router_static_route **new_routes
>          = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
> @@ -6361,7 +6372,8 @@ static const struct ctl_command_syntax
> nbctl_commands[] = {
>
>      /* logical router route commands. */
>      { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
> -      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW },
> +      nbctl_lr_route_add, NULL,
> "--may-exist,--ecmp,--ecmp-symmetric-reply,"
> +      "--policy=", RW },
>      { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]", NULL,
>        nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
>      { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list, NULL,
> --
> 2.25.4
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>


More information about the dev mailing list