[ovs-dev] [PATCH ovn] lflow: Use learn() action to generate LB hairpin reply flows.

Dumitru Ceara dceara at redhat.com
Mon Feb 8 10:32:22 UTC 2021


On 2/8/21 11:09 AM, Numan Siddique wrote:
> On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara <dceara at redhat.com> wrote:
>>
>> The main trait of load balancer hairpin traffic is that it never leaves
>> the local hypervisor.  Essentially this means that only hairpin
>> openflows installed for logical switches that have at least one logical
>> switch port bound locally can ever be hit.
>>
>> Until now, if a load balancer was applied on multiple logical switches
>> that are connected through a distributed router, ovn-controller would
>> install flows to detect hairpin replies for all logical switches. In
>> practice this leads to a very high number of openflows out of which
>> most will never be used.
>>
>> Instead we now use an additional action, learn(), on flows that match on
>> packets that create the hairpin session.  The learn() action will then
>> generate the necessary flows to handle hairpin replies, but only for
>> the local datapaths which actually generate hairpin traffic.
>>
>> For example, simulating how ovn-k8s uses load balancer for services,
>> in a "switch per node" scenario, the script below would generate
>> 10K (n_nodes * n_vips * n_backends) openflows on every node in table=69
>> (hairpin reply).  With this patch the maximum number of openflows that
>> can be created for hairpin replies is 200 (n_vips * n_backends).
>>
>> In general, for deployments that leverage switch-per-node topologies,
>> the number of openflows is reduced by a factor of N, where N is the
>> number of nodes.
>>
>>    $ cat lbs.sh
>>    NODES=50
>>    VIPS=20
>>    BACKENDS=10
>>    ovn-nbctl lr-add rtr
>>    for ((i = 1; i <= $NODES; i++)); do
>>        ovn-nbctl \
>>            -- ls-add ls$i \
>>            -- lsp-add ls$i vm$i \
>>            -- lsp-add ls$i ls$i-rtr \
>>            -- lsp-set-type ls$i-rtr router \
>>            -- lsp-set-options ls$i-rtr router-port=rtr-ls$i \
>>            -- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24
>>    done
>>
>>    for ((i = 1; i <= $VIPS; i++)); do
>>        lb=lb$i
>>        vip=10.10.10.$i:1
>>        bip=20.20.20.1:2
>>        for ((j = 2; j <= $BACKENDS; j++)); do
>>            bip="$bip,20.20.20.$j:2"
>>        done
>>        ovn-nbctl lb-add $lb $vip $backends
>>    done
>>
>>    for ((i = 1; i <= $NODES; i++)); do
>>        for ((j = 1; j <= $VIPS; j++)); do
>>            ovn-nbctl ls-lb-add ls$i lb$j
>>        done
>>    done
>>
>>    ovs-vsctl add-port br-int vm1 \
>>        -- set interface vm1 type=internal \
>>        -- set interface vm1 external-ids:iface-id=vm1
>>
>> Suggested-by: Ilya Maximets <i.maximets at ovn.org>
>> Signed-off-by: Dumitru Ceara <dceara at redhat.com>
> 
> Hi Dumitru,

Hi Numan,

> 
> Thanks for this patch. The patch LGTM. I tested it out with  huge NB
> DB with lots
> of load balancer and backends.  I see a significant reduction in OF flows.

Thanks for the review!

> 
> Please see the few comments below.
> 
> Thanks
> Numan
> 
>> ---
>>   controller/lflow.c      | 204 ++++++++++++++++---
>>   tests/ofproto-macros.at |   5 +-
>>   tests/ovn.at            | 516 +++++++++++++++++++++++++-----------------------
>>   3 files changed, 455 insertions(+), 270 deletions(-)
>>
>> diff --git a/controller/lflow.c b/controller/lflow.c
>> index 946c1e0..2b7d356 100644
>> --- a/controller/lflow.c
>> +++ b/controller/lflow.c
>> @@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index *sbrec_port_binding_by_name,
>>       }
>>   }
>>
>> +/* Builds the "learn()" action to be triggered by packets initiating a
>> + * hairpin session.
>> + *
>> + * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the form:
>> + * - match:
>> + *     metadata=<orig-pkt-metadata>,ip/ipv6,ip.src=<backend>,ip.dst=<vip>
>> + *     nw_proto='lb_proto',tp_src_port=<backend-port>
>> + * - action:
>> + *     set MLF_LOOKUP_LB_HAIRPIN_BIT=1
>> + */
>> +static void
>> +add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip,
>> +                                uint8_t lb_proto, bool has_l4_port,
>> +                                uint64_t cookie, struct ofpbuf *ofpacts)
>> +{
>> +    struct match match = MATCH_CATCHALL_INITIALIZER;
>> +    struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts);
>> +    struct ofpact_learn_spec *ol_spec;
>> +    unsigned int imm_bytes;
>> +    uint8_t *src_imm;
>> +
>> +    /* Once learned, hairpin reply flows are permanent until the VIP/backend
>> +     * is removed.
>> +     */
>> +    ol->flags = NX_LEARN_F_DELETE_LEARNED;
>> +    ol->idle_timeout = OFP_FLOW_PERMANENT;
>> +    ol->hard_timeout = OFP_FLOW_PERMANENT;
>> +    ol->priority = OFP_DEFAULT_PRIORITY;
>> +    ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY;
>> +    ol->cookie = htonll(cookie);
>> +
>> +    /* Match on metadata of the packet that created the hairpin session. */
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +
>> +    ol_spec->dst.field = mf_from_id(MFF_METADATA);
>> +    ol_spec->dst.ofs = 0;
>> +    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +    ol_spec->src_type = NX_LEARN_SRC_FIELD;
>> +    ol_spec->src.field = mf_from_id(MFF_METADATA);
>> +
>> +    /* Match on the same ETH type as the packet that created the hairpin
>> +     * session.
>> +     */
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +    ol_spec->dst.field = mf_from_id(MFF_ETH_TYPE);
>> +    ol_spec->dst.ofs = 0;
>> +    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
>> +    union mf_value imm_eth_type = {
>> +        .be16 = !vip6 ? htons(ETH_TYPE_IP) : htons(ETH_TYPE_IPV6)
>> +    };
>> +    mf_write_subfield_value(&ol_spec->dst, &imm_eth_type, &match);
>> +
>> +    /* Push value last, as this may reallocate 'ol_spec'. */
>> +    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
>> +    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
>> +    memcpy(src_imm, &imm_eth_type, imm_bytes);
>> +
>> +    /* Hairpin replies have ip.src == <backend-ip>. */
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +    if (!vip6) {
>> +        ol_spec->dst.field = mf_from_id(MFF_IPV4_SRC);
>> +        ol_spec->src.field = mf_from_id(MFF_IPV4_SRC);
>> +    } else {
>> +        ol_spec->dst.field = mf_from_id(MFF_IPV6_SRC);
>> +        ol_spec->src.field = mf_from_id(MFF_IPV6_SRC);
>> +    }
>> +    ol_spec->dst.ofs = 0;
>> +    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +    ol_spec->src_type = NX_LEARN_SRC_FIELD;
>> +
>> +    /* Hairpin replies have ip.dst == <vip>. */
>> +    union mf_value imm_ip;
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +    if (!vip6) {
>> +        ol_spec->dst.field = mf_from_id(MFF_IPV4_DST);
>> +        imm_ip = (union mf_value) {
>> +            .be32 = vip
>> +        };
>> +    } else {
>> +        ol_spec->dst.field = mf_from_id(MFF_IPV6_DST);
>> +        imm_ip = (union mf_value) {
>> +            .ipv6 = *vip6
>> +        };
>> +    }
>> +    ol_spec->dst.ofs = 0;
>> +    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
>> +    mf_write_subfield_value(&ol_spec->dst, &imm_ip, &match);
>> +
>> +    /* Push value last, as this may reallocate 'ol_spec' */
>> +    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
>> +    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
>> +    memcpy(src_imm, &imm_ip, imm_bytes);
>> +
>> +    /* Hairpin replies have the same nw_proto as packets that created the
>> +     * session.
>> +     */
>> +    union mf_value imm_proto = {
>> +        .u8 = lb_proto,
>> +    };
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +    ol_spec->dst.field = mf_from_id(MFF_IP_PROTO);
>> +    ol_spec->src.field = mf_from_id(MFF_IP_PROTO);
>> +    ol_spec->dst.ofs = 0;
>> +    ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
>> +    mf_write_subfield_value(&ol_spec->dst, &imm_proto, &match);
>> +
>> +    /* Push value last, as this may reallocate 'ol_spec' */
>> +    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
>> +    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
>> +    memcpy(src_imm, &imm_proto, imm_bytes);
>> +
>> +    /* Hairpin replies have source port == <backend-port>. */
>> +    if (has_l4_port) {
>> +        ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +        switch (lb_proto) {
>> +        case IPPROTO_TCP:
>> +            ol_spec->dst.field = mf_from_id(MFF_TCP_SRC);
>> +            ol_spec->src.field = mf_from_id(MFF_TCP_DST);
>> +            break;
>> +        case IPPROTO_UDP:
>> +            ol_spec->dst.field = mf_from_id(MFF_UDP_SRC);
>> +            ol_spec->src.field = mf_from_id(MFF_UDP_DST);
>> +            break;
>> +        case IPPROTO_SCTP:
>> +            ol_spec->dst.field = mf_from_id(MFF_SCTP_SRC);
>> +            ol_spec->src.field = mf_from_id(MFF_SCTP_DST);
>> +            break;
>> +        default:
>> +            OVS_NOT_REACHED();
>> +            break;
>> +        }
>> +        ol_spec->dst.ofs = 0;
>> +        ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
>> +        ol_spec->n_bits = ol_spec->dst.n_bits;
>> +        ol_spec->dst_type = NX_LEARN_DST_MATCH;
>> +        ol_spec->src_type = NX_LEARN_SRC_FIELD;
>> +    }
>> +
>> +    /* Set MLF_LOOKUP_LB_HAIRPIN_BIT for hairpin replies. */
>> +    ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
>> +    ol_spec->dst.field = mf_from_id(MFF_LOG_FLAGS);
>> +    ol_spec->dst.ofs = MLF_LOOKUP_LB_HAIRPIN_BIT;
>> +    ol_spec->dst.n_bits = 1;
>> +    ol_spec->n_bits = ol_spec->dst.n_bits;
>> +    ol_spec->dst_type = NX_LEARN_DST_LOAD;
>> +    ol_spec->src_type = NX_LEARN_SRC_IMMEDIATE;
>> +    union mf_value imm_reg_value = {
>> +        .u8 = 1
>> +    };
>> +    mf_write_subfield_value(&ol_spec->dst, &imm_reg_value, &match);
>> +
>> +    /* Push value last, as this may reallocate 'ol_spec' */
>> +    imm_bytes = DIV_ROUND_UP(ol_spec->dst.n_bits, 8);
>> +    src_imm = ofpbuf_put_zeros(ofpacts, OFPACT_ALIGN(imm_bytes));
>> +    memcpy(src_imm, &imm_reg_value, imm_bytes);
>> +
>> +    ofpact_finish_LEARN(ofpacts, &ol);
> 
> The above code makes perfect sense. But I think we can reduce the amount of
> code by making use of the 'learn_parse()' function present in
> lib/learn.c of OVS.
> 
> This would also make the code much more simpler. Otherwise the reader
> has to understand all the intricacies of adding 'learn' ofpact.
> 
> I tried something like below to test it out and It worked fine to me
> 
> *****************
> struct ds learn = DS_EMPTY_INITIALIZER;
> ds_put_format(&learn, "table=%d,delete_learned,OXM_OF_METADATA[],"
> 
> "eth_type=0x800,nw_proto=6,NXM_OF_IP_SRC[],ip_dst="IP_FMT","
> 
> "NXM_OF_TCP_SRC[]=NXM_OF_TCP_DST[],load:0x1->NXM_NX_REG10[7]",
>                          OFTABLE_CHK_LB_HAIRPIN_REPLY, IP_ARGS(vip));
> char *error = learn_parse(ds_cstr(&learn), NULL, NULL, &ofpbuf_learn);
> if (!error) {
> ofpbuf_put(ofpacts, ofpbuf_learn.data, ofpbuf_learn.size);
> }
> ...
> ..
> 
> ******************
> 
> What do you think ?

To be honest, I'm not so sure about this.  It would create a dependency 
on how OVS formats openflows internally.

I think we can argue the same thing about intricacies of adding other 
OVS actions, e.g.:

https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/controller/lflow.c#L1267

https://github.com/ovn-org/ovn/blob/44ea2ec88136f83e7eab9790473025b6c95bdcc0/lib/actions.c#L422

We could change those too and use OVS parsing internal utilities to 
parse strings but we don't and, in my opinion, the proper way to go is 
to configure the actions programatically by using the data structures in 
include/openvswitch/ofp-actions.h

If it helps I can try to refactor a bit the code that creates the 
learn() action for hairpin although I'm not sure at the moment how to do 
that.

> 
> Thanks
> Numan

Regards,
Dumitru



More information about the dev mailing list