[ovs-dev] [PATCH RFC 2/2] openvswitch: Userspace tunneling.

Jarno Rajahalme jrajahalme at nicira.com
Thu Oct 9 21:53:30 UTC 2014


Pravin,

Please find my comments below. I did not snip any code to make it easier for you to keep into context while reading this review.

  Jarno

On Oct 3, 2014, at 8:24 PM, Pravin B Shelar <pshelar at nicira.com> wrote:

> Following patch adds support for userspace tunneling. Tunneling
> needs three more component first is routing table which is configured by
> user and second is ARP cache which build automatically by snooping arp.
> And third is tunnel protocol table which list all listening protocols
> which is populated by vswitchd as tunnel ports are added.
> 
> Tunneling works as follows:
> On packet receive vswitchd check if this packet is targeted to tunnel
> port. If it is then vswitchd inserts tunnel pop action which pops
> header and sends packet to tunnel port.
> On packet xmit rather than generating Set tunnel action it generate
> tunnel push action which has tunnel header data. datapath can use
> tunnel-push action data to generate header for each packet and
> forward this packet to output port. Since tunnel-push action
> contains most of packet header it need to lookup routing table and
> arp table to build this action.
> 
> Signed-off-by: Pravin B Shelar <pshelar at nicira.com>
> ---
> Makefile.am                                       |   1 +
> README-Userspace-Tunneling                        |  81 ++++++++
> datapath/linux/compat/include/linux/openvswitch.h |  10 +
> debian/openvswitch-common.docs                    |   1 +
> lib/automake.mk                                   |   9 +-
> lib/dpif-netdev.c                                 | 125 +++++++++++-
> lib/dpif.c                                        |  25 +++
> lib/dpif.h                                        |   4 +
> lib/netdev-bsd.c                                  |   3 +
> lib/netdev-dpdk.c                                 |   3 +
> lib/netdev-dummy.c                                |   3 +
> lib/netdev-linux.c                                |   3 +
> lib/netdev-provider.h                             |   9 +
> lib/netdev-vport.c                                | 186 +++++++++++++++--
> lib/netdev.c                                      |  34 ++++
> lib/netdev.h                                      |   7 +
> lib/odp-execute.c                                 |   2 +
> lib/odp-util.c                                    | 118 +++++++++++
> lib/odp-util.h                                    |   4 +
> lib/ofpbuf.h                                      |   8 +
> lib/packets.c                                     |  35 ++++
> lib/packets.h                                     |   9 +-
> lib/tnl-arp-cache.c                               | 214 +++++++++++++++++++
> lib/tnl-arp-cache.h                               |  47 +++++
> lib/tnl-ports.c                                   | 203 ++++++++++++++++++
> lib/tnl-ports.h                                   |  49 +++++
> lib/tnl-router.c                                  | 237 ++++++++++++++++++++++
> lib/tnl-router.h                                  |  42 ++++
> lib/vlog.c                                        |   1 +
> lib/vxlan.h                                       |  46 +++++
> ofproto/ofproto-dpif-xlate.c                      | 176 +++++++++++++++-
> ofproto/ofproto-dpif-xlate.h                      |   3 +-
> ofproto/ofproto-dpif.c                            |  38 +++-
> ofproto/tunnel.c                                  |  79 +++++++-
> ofproto/tunnel.h                                  |  10 +-
> rhel/openvswitch.spec.in                          |   2 +-
> vswitchd/ovs-vswitchd.8.in                        |   4 +
> vswitchd/ovs-vswitchd.c                           |   6 +
> 38 files changed, 1785 insertions(+), 52 deletions(-)
> create mode 100644 README-Userspace-Tunneling
> create mode 100644 lib/tnl-arp-cache.c
> create mode 100644 lib/tnl-arp-cache.h
> create mode 100644 lib/tnl-ports.c
> create mode 100644 lib/tnl-ports.h
> create mode 100644 lib/tnl-router.c
> create mode 100644 lib/tnl-router.h
> create mode 100644 lib/vxlan.h
> 
> diff --git a/Makefile.am b/Makefile.am
> index 77ceec6..3941e0f 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -84,6 +84,7 @@ EXTRA_DIST = \
>  	PORTING \
>  	README.md \
>  	README-lisp \
> +	README-Userspace-Tunneling \
>  	REPORTING-BUGS \
>  	TODO \
>  	.travis.yml \
> diff --git a/README-Userspace-Tunneling b/README-Userspace-Tunneling
> new file mode 100644
> index 0000000..1aebbca
> --- /dev/null
> +++ b/README-Userspace-Tunneling
> @@ -0,0 +1,81 @@
> +
> +Open vSwitch support tunneling in userspace. Tunneling is implemented in
> +platform independent way. But it does expect following from platform:
> +1. dpif-netdev bridge is backed by netdev should be able to
> +   reply to ARP requests.

What does it mean for a netdev to be able to reply to ARP requests? Or do you mean that the bridge should be able to reply to ARP requests?

> +2. OVS tunneling routing table is configured.
> +
> +Setup:
> +======
> +First start vswitch with --userspace-tunneling option to enable tunneling.

I’ll come back to this later, but I think we should treat the new tunneling feature like any other optional datapath feature: Probe for it and (only) use it if the probe was successful. That way it will be easier for other datapaths to implement this and make use of it. I don’t know any details of, e.g., Windows datapath tunneling plans, but it should be feasible to use the same methods also with kernel datapaths.

> +Setup physical bridges for app physical interfaces. create integration bridge.

Would it be possible to use the new tunneling feature with a single bridge? I see it may require recirculation, but is that really different from the case of different physical and integration bridges? I’m also trying to figure out if this model could fit the case of switching ASIC supporting tunneling, where a single switch view might be preferable.

> +Add vxlan port to int-bridge. Assign IP address to physical bridge where
> +vxlan traffic is expected. Add tunnel routes for the hysical bridge.

“physical”

> +
> +Example:
> +========
> +If you want to connect to vxlan tunnel endpoint  (logical ip: 192.168.1.2
> +and remote physical ip: 172.168.1.2) that you want to connect to.
> +
> +In this case you need to configure OVS bridges as follows.
> +
> +1. Let assume 172.168.1.2/24 network is reachable via eth1 create physical bridge br-eth1
> +   assign ip address (172.168.1.1/24) to br-eth1, Add eth1 to br-eth1
> +2. Add route to ovs
> +   ovs-appctl tnl/route/add 172.168.1.1 255.255.255.0 br-eth1

Could OVS do this step automatically based on the assigned IP address, as all the parameters are already there?

I see that if the route involves a gateway router it needs to be explicitly added, though.

> +3. Add integration brdge int-br and add tunnel port using standard syntax.
> +   ovs-vsctl add-port int-br vxlan0 -- set interface vxlan0 type=vxlan  options:remote_ip=172.168.1.2
> +4. Assign IP address to int-br, So final topology looks like:
> +
> +
> +       192.168.1.1/24
> +       +--------------+
> +       |    int-br    |                                      192.168.1.2/24
> +       +--------------+                                      +--------------+
> +       |    vxlan0    |                                      |    vxlan0    |
> +       +--------------+                                      +--------------+
> +             |                                                     |
> +             |                                                     |
> +             |                                                     |
> +        172.168.1.1/24                                             |
> +       +--------------+                                            |
> +       |    br-eth1   |                                      172.168.1.2/24
> +       +--------------+                                      +---------------+
> +       |    eth1      |--------------------------------------|    eth1       |
> +       +--------------+                                      +----------------
> +
> +       Host A with OVS.                                      Remote host.
> +
> +With this setup, ping to vxlan target device (192.168.1.2) should wotk

“wotk” -> “work"

> +There are following commands that shows internal tables:
> +
> +Tunneling related commands:
> +===========================
> +Tunnel routing table:
> +    To Add route:
> +       ovs-appctl tnl/route/add <IP address> <mask> <output-bridge-name>

Does this syntax support routing via a gateway router, when the peer is not in the same subnet?

> +    To see all routes configured:
> +       ovs-appctl tnl/route/show
> +    To del route
> +       ovs-appctl tnl/route/del <IP address> <mask>
> +
> +ARP:
> +    To see arp cache content:
> +       ovs-appctl tnl/arp/show
> +    To flush arp cache
> +        ovs-appctl tnl/arp/flush
> +
> +To check tunnel ports listening in vswitchd:
> +ovs-appctl tnl/ports/show
> +
> +Limits:
> +=======
> + - Does not handle IP fragments
> + - Only VxLan support
> + - Does not works correctly with vlans.
> +
> +I hope we will fix it soon.
> +
> +Contact
> +=======
> +bugs at openvswitch.org
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index b2257e6..22647e9 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -580,6 +580,12 @@ struct ovs_action_hash {
> 	uint32_t  hash_basis;
> };
> 
> +enum ovs_tunnel_action_attr {
> +	OVS_ACTION_TUNNEL_HEADER_PORT,
> +	OVS_ACTION_TUNNEL_HEADER_PHY_PORT,
> +	OVS_ACTION_TUNNEL_HEADER_DATA,
> +};
> +
> /**
>  * enum ovs_action_attr - Action types.
>  *
> @@ -633,6 +639,10 @@ enum ovs_action_attr {
> 				       * data immediately followed by a mask.
> 				       * The data must be zero for the unmasked
> 				       * bits. */
> +#ifndef __KERNEL__
> +	OVS_ACTION_ATTR_TUNNEL_PUSH = 64,
> +	OVS_ACTION_ATTR_TUNNEL_POP,
> +#endif

Managing the number allocation will become tricky if we’ll get e.g. Windows kernel DP supporting these.

> 	__OVS_ACTION_ATTR_MAX
> };
> 
> diff --git a/debian/openvswitch-common.docs b/debian/openvswitch-common.docs
> index 3bd2ca3..9d053e2 100644
> --- a/debian/openvswitch-common.docs
> +++ b/debian/openvswitch-common.docs
> @@ -1,2 +1,3 @@
> FAQ
> INSTALL.DPDK
> +README-Userspace-Tunneling
> diff --git a/lib/automake.mk b/lib/automake.mk
> index c44a77f..da54204 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -231,6 +231,12 @@ lib_libopenvswitch_la_SOURCES = \
> 	lib/timer.h \
> 	lib/timeval.c \
> 	lib/timeval.h \
> +	lib/tnl-arp-cache.c \
> +	lib/tnl-arp-cache.h \
> +	lib/tnl-ports.c \
> +	lib/tnl-ports.h \
> +	lib/tnl-router.c \
> +	lib/tnl-router.h \
> 	lib/token-bucket.c \
> 	lib/token-bucket.h \
> 	lib/type-props.h \
> @@ -257,7 +263,8 @@ lib_libopenvswitch_la_SOURCES = \
> 	lib/vswitch-idl.c \
> 	lib/vswitch-idl.h \
> 	lib/vtep-idl.c \
> -	lib/vtep-idl.h
> +	lib/vtep-idl.h \
> +	lib/vxlan.h
> 
> if WIN32
> lib_libopenvswitch_la_SOURCES += \
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index a7c9c92..81600f0 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -2791,15 +2791,83 @@ dpif_netdev_register_upcall_cb(struct dpif *dpif, upcall_callback *cb,
> static void
> dp_netdev_drop_packets(struct dpif_packet ** packets, int cnt, bool may_steal)
> {
> -    int i;
> -
>     if (may_steal) {
> +        int i;
> +
>         for (i = 0; i < cnt; i++) {
>             dpif_packet_delete(packets[i]);
>         }
>     }
> }
> 
> +static int
> +odp_push_tunnel_action(const struct dp_netdev *dp,
> +                       const struct nlattr *attr,
> +                       struct dpif_packet **packets, int cnt)
> +{
> +    struct dp_netdev_port *tun_port;
> +    unsigned int left;
> +    const struct nlattr *a;
> +    const void *header = NULL;
> +    int header_len = 0;
> +    odp_port_t tunnel_oprt = 0;
> +    odp_port_t phy_oprt = 0;
> +    int i;
> +
> +    NL_NESTED_FOR_EACH(a, left, attr) {
> +        uint16_t type = nl_attr_type(a);
> +        size_t len = nl_attr_get_size(a);
> +
> +        switch (type) {
> +        case OVS_ACTION_TUNNEL_HEADER_PORT:
> +            tunnel_oprt = u32_to_odp(nl_attr_get_u32(a));
> +        break;
> +
> +        case OVS_ACTION_TUNNEL_HEADER_PHY_PORT:
> +            phy_oprt = u32_to_odp(nl_attr_get_u32(a));
> +        break;
> +
> +        case OVS_ACTION_TUNNEL_HEADER_DATA:
> +            header = nl_attr_get(a);
> +            header_len = len;
> +        break;
> +
> +        default:
> +            VLOG_ERR("Unknown tunnel push param.");
> +            return -EINVAL;
> +        }
> +    }
> +    if (!header_len || !tunnel_oprt || !phy_oprt) {
> +        VLOG_ERR("Insufficient tunnel push param.");
> +        return -EINVAL;
> +    }
> +
> +    tun_port = dp_netdev_lookup_port(dp, tunnel_oprt);
> +    if (!tun_port) {
> +        return -EINVAL;
> +    }
> +    netdev_push_header(tun_port->netdev, packets, cnt, header, header_len);
> +
> +    for (i = 0; i < cnt; i++) {
> +        packets[i]->md = PKT_METADATA_INITIALIZER(phy_oprt);

I see the need to clear out the metadata, as the inner packet is now hidden by encapsulation. However, I’m not sure if it is a good idea to overload the ‘in_port’ member for the output tunnel port, though.

Also, maybe this step should be performed by netdev_push_header(), as it is transforming the packets anyway.

> +    }
> +    return 0;
> +}
> +
> +static void
> +dp_netdev_clone_pkt_batch(struct dpif_packet **tnl_pkt,
> +                          struct dpif_packet **packets, int cnt)
> +{
> +    int i;
> +
> +    for (i = 0; i < cnt; i++) {
> +        struct dpif_packet *inner_pkt;
> +
> +         inner_pkt = dpif_packet_clone(packets[i]);
> +         tnl_pkt[i] = inner_pkt;
> +    }
> +}
> +
> static void
> dp_execute_cb(void *aux_, struct dpif_packet **packets, int cnt,
>               const struct nlattr *a, bool may_steal)
> @@ -2822,6 +2890,59 @@ dp_execute_cb(void *aux_, struct dpif_packet **packets, int cnt,
>         }
>         break;
> 
> +    case OVS_ACTION_ATTR_TUNNEL_PUSH:
> +        if (*depth < MAX_RECIRC_DEPTH) {
> +            struct dpif_packet *tnl_pkt[NETDEV_MAX_RX_BATCH];
> +            int err;
> +
> +            if (may_steal) {
> +                dp_netdev_clone_pkt_batch(tnl_pkt, packets, cnt);
> +                packets = tnl_pkt;
> +            }

Should this be the reverse? Clone if can NOT take the packets?

> +
> +            err = odp_push_tunnel_action(dp, a, packets, cnt);
> +            if (err) {
> +                dp_netdev_drop_packets(tnl_pkt, cnt, may_steal);
> +                break;
> +            }
> +
> +            (*depth)++;
> +            dp_netdev_input(pmd, packets, cnt);
> +            (*depth)--;
> +            return;
> +        }

Should “break” here.

> +
> +    case OVS_ACTION_ATTR_TUNNEL_POP:
> +        if (*depth >= MAX_RECIRC_DEPTH) {
> +            break;
> +        }
> +
> +        p = dp_netdev_lookup_port(dp, u32_to_odp(nl_attr_get_u32(a)));
> +        if (p) {
> +            struct dpif_packet *tnl_pkt[NETDEV_MAX_RX_BATCH];
> +            int err;
> +
> +            if (may_steal) {
> +                dp_netdev_clone_pkt_batch(tnl_pkt, packets, cnt);
> +                packets = tnl_pkt;
> +            }

Ditto.

> +
> +            err = netdev_pop_header(p->netdev, packets, cnt);
> +            if (!err) {
> +
> +                for (i = 0; i < cnt; i++) {
> +                    packets[i]->md.in_port.odp_port = u32_to_odp(nl_attr_get_u32(a));

Maybe use the same port already taken from nl_attr above?

> +                }
> +
> +                (*depth)++;
> +                dp_netdev_input(pmd, packets, cnt);
> +                (*depth)--;
> +                return;
> +            }
> +            dp_netdev_drop_packets(tnl_pkt, cnt, may_steal);
> +        }
> +        break;
> +
>     case OVS_ACTION_ATTR_USERSPACE:
>         if (!fat_rwlock_tryrdlock(&dp->upcall_rwlock)) {
>             const struct nlattr *userdata;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 34d8c13..669ca3f 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -41,6 +41,8 @@
> #include "shash.h"
> #include "sset.h"
> #include "timeval.h"
> +#include "tnl-arp-cache.h"
> +#include "tnl-ports.h"
> #include "util.h"
> #include "valgrind.h"
> #include "vlog.h"
> @@ -72,6 +74,9 @@ struct registered_dpif_class {
> static struct shash dpif_classes = SHASH_INITIALIZER(&dpif_classes);
> static struct sset dpif_blacklist = SSET_INITIALIZER(&dpif_blacklist);
> 
> +/* User configured value, by default off. */
> +bool userspace_tunneling = false;
> +

I’d make this a datapath feature we probe with a flow set-up. Then use the feature whenever it is available.

> /* Protects 'dpif_classes', including the refcount, and 'dpif_blacklist'. */
> static struct ovs_mutex dpif_mutex = OVS_MUTEX_INITIALIZER;
> 
> @@ -114,6 +119,8 @@ dp_initialize(void)
>         }
>         dpctl_unixctl_register();
>         ovsthread_once_done(&once);
> +        tnl_port_init();
> +        tnl_arp_cache_init();
>     }
> }
> 
> @@ -408,6 +415,7 @@ dpif_run(struct dpif *dpif)
>     if (dpif->dpif_class->run) {
>         dpif->dpif_class->run(dpif);
>     }
> +    tnl_arp_cache_run();
> }
> 
> /* Arranges for poll_block() to wake up when dp_run() needs to be called for
> @@ -1002,6 +1010,8 @@ dpif_execute_helper_cb(void *aux_, struct dpif_packet **packets, int cnt,
> 
>     switch ((enum ovs_action_attr)type) {
>     case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_TUNNEL_PUSH:
> +    case OVS_ACTION_ATTR_TUNNEL_POP:
>     case OVS_ACTION_ATTR_USERSPACE:
>     case OVS_ACTION_ATTR_RECIRC: {
>         struct dpif_execute execute;
> @@ -1591,3 +1601,18 @@ log_flow_get_message(const struct dpif *dpif, const struct dpif_flow_get *get,
>                          get->flow->actions, get->flow->actions_len);
>     }
> }
> +
> +void dpif_enable_userspace_tunneling(void)
> +{
> +    userspace_tunneling = true;
> +}
> +
> +bool dpif_is_userspace_tunneling_on(void)
> +{
> +    return userspace_tunneling;
> +}
> +
> +bool dpif_support_userspace_tunneling(const struct dpif *dpif)
> +{
> +   return !strcmp(dpif->dpif_class->type, "netdev");
> +}

All these would not be necessary if ofproto-dpif probes for the new actions, and each datapath then either supports them or not.

> diff --git a/lib/dpif.h b/lib/dpif.h
> index f88fa78..7402a85 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -790,6 +790,10 @@ void dpif_get_netflow_ids(const struct dpif *,
> int dpif_queue_to_priority(const struct dpif *, uint32_t queue_id,
>                            uint32_t *priority);
> 
> +void dpif_enable_userspace_tunneling(void);
> +bool dpif_is_userspace_tunneling_on(void);
> +bool dpif_support_userspace_tunneling(const struct dpif *);
> +
> #ifdef  __cplusplus
> }
> #endif
> diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
> index 1bd9108..473c0f6 100644
> --- a/lib/netdev-bsd.c
> +++ b/lib/netdev-bsd.c
> @@ -1562,6 +1562,9 @@ netdev_bsd_update_flags(struct netdev *netdev_, enum netdev_flags off,
>     NULL, /* get_config */                           \
>     NULL, /* set_config */                           \
>     NULL, /* get_tunnel_config */                    \
> +    NULL, /* build header */                         \
> +    NULL, /* push header */                          \
> +    NULL, /* pop header */                           \
>     NULL, /* get_numa_id */                          \
>     NULL, /* set_multiq */                           \
>                                                      \
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 9c93768..c8610d7 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -1392,6 +1392,9 @@ unlock_dpdk:
>     netdev_dpdk_get_config,                                   \
>     NULL,                       /* netdev_dpdk_set_config */  \
>     NULL,                       /* get_tunnel_config */       \
> +    NULL, /* build header */                                  \
> +    NULL, /* push header */                                   \
> +    NULL, /* pop header */                                    \
>     netdev_dpdk_get_numa_id,    /* get_numa_id */             \
>     MULTIQ,                     /* set_multiq */              \
>                                                               \
> diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
> index a2b1f2c..22d980a 100644
> --- a/lib/netdev-dummy.c
> +++ b/lib/netdev-dummy.c
> @@ -1032,6 +1032,9 @@ static const struct netdev_class dummy_class = {
>     netdev_dummy_get_config,
>     netdev_dummy_set_config,
>     NULL,                       /* get_tunnel_config */
> +    NULL,                       /* build header */
> +    NULL,                       /* push header */
> +    NULL,                       /* pop header */
>     NULL,                       /* get_numa_id */
>     NULL,                       /* set_multiq */
> 
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index ab5ef7d..e89063a 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -2715,6 +2715,9 @@ netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off,
>     NULL,                       /* get_config */                \
>     NULL,                       /* set_config */                \
>     NULL,                       /* get_tunnel_config */         \
> +    NULL,                       /* build header */              \
> +    NULL,                       /* push header */               \
> +    NULL,                       /* pop header */                \
>     NULL,                       /* get_numa_id */               \
>     NULL,                       /* set_multiq */                \
>                                                                 \
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index 0e4cda5..50b63a6 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -254,6 +254,15 @@ struct netdev_class {
>     const struct netdev_tunnel_config *
>         (*get_tunnel_config)(const struct netdev *netdev);
> 
> +    void (*build_header)(const struct netdev *, struct ofpbuf *);
> +
> +    int (*push_header)(const struct netdev *netdev,
> +                       struct dpif_packet **buffers, int cnt,
> +                       const void *header, int size);
> +
> +    int  (*pop_header)(struct netdev *netdev,
> +                       struct dpif_packet **buffers, int cnt);
> +
>     /* Returns the id of the numa node the 'netdev' is on.  If there is no
>      * such info, returns NETDEV_NUMA_UNSPEC. */
>     int (*get_numa_id)(const struct netdev *netdev);
> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
> index 83f1296..33ff49b 100644
> --- a/lib/netdev-vport.c
> +++ b/lib/netdev-vport.c
> @@ -25,6 +25,7 @@
> #include <sys/ioctl.h>
> 
> #include "byte-order.h"
> +#include "csum.h"
> #include "daemon.h"
> #include "dirs.h"
> #include "dpif.h"
> @@ -32,13 +33,17 @@
> #include "hmap.h"
> #include "list.h"
> #include "netdev-provider.h"
> +#include "vxlan.h"
> #include "ofpbuf.h"
> #include "packets.h"
> +#include "packet-dpif.h"
> #include "poll-loop.h"
> #include "route-table.h"
> #include "shash.h"
> #include "socket-util.h"
> #include "vlog.h"
> +#include "unaligned.h"
> +#include "util.h"
> 
> VLOG_DEFINE_THIS_MODULE(netdev_vport);
> 
> @@ -221,10 +226,24 @@ netdev_vport_alloc(void)
> static int
> netdev_vport_construct(struct netdev *netdev_)
> {
> -    struct netdev_vport *netdev = netdev_vport_cast(netdev_);
> +    struct netdev_vport *dev = netdev_vport_cast(netdev_);
> +    const char *type = netdev_get_type(netdev_);
> +
> +    ovs_mutex_init(&dev->mutex);
> +    eth_addr_random(dev->etheraddr);
> +
> +    /* Add a default destination port for tunnel ports if none specified. */
> +    if (!strcmp(type, "geneve")) {
> +        dev->tnl_cfg.dst_port = htons(GENEVE_DST_PORT);
> +    }
> +
> +    if (!strcmp(type, "vxlan")) {
> +        dev->tnl_cfg.dst_port = htons(VXLAN_DST_PORT);
> +    }
> 
> -    ovs_mutex_init(&netdev->mutex);
> -    eth_addr_random(netdev->etheraddr);
> +    if (!strcmp(type, "lisp")) {
> +        dev->tnl_cfg.dst_port = htons(LISP_DST_PORT);
> +    }
> 
>     route_table_register();
> 
> @@ -761,9 +780,134 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats)
> 
>     return 0;
> }
> +
> +static bool
> +vxlan_extract_md(struct dpif_packet *dpif_pkt)
> +{
> +    struct ofpbuf *packet = &dpif_pkt->ofpbuf;
> +    struct pkt_metadata *md = &dpif_pkt->md;
> +    struct flow_tnl *tnl = &md->tunnel;
> +    struct ip_header *nh;
> +    struct udp_header *udp;
> +    struct vxlanhdr *vxh;
> +
> +    nh = ofpbuf_l3(packet);
> +    udp = ofpbuf_l4(packet);
> +
> +    memset(md, 0, sizeof(*md));
> +
> +    if (!nh || !udp) {
> +        return false;
> +    }
> +    if (((const char *) ofpbuf_tail(packet) - (const char *) udp) <
> +        (UDP_HEADER_LEN + sizeof (*vxh))) {
> +        return false;
> +    }
> +    vxh = (struct vxlanhdr *) ((const char *)udp + sizeof(*udp));
> +
> +    if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
> +       (vxh->vx_vni & htonl(0xff))) {
> +        VLOG_ERR("invalid vxlan flags=%#x vni=%#x\n",
> +                 ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
> +        return false;
> +    }
> +
> +    tnl->tun_id = vxh->vx_vni;
> +    tnl->ip_src = get_16aligned_be32(&nh->ip_src);
> +    tnl->ip_dst = get_16aligned_be32(&nh->ip_dst);
> +    tnl->ip_tos = nh->ip_tos;
> +    tnl->ip_ttl = nh->ip_ttl;
> +
> +    ofpbuf_reset_packet(packet, ((const char *) (vxh + 1) - ((const char *) packet->frame)));
> +    return true;
> +}
> +
> +static int
> +netdev_vxlan_pop_header(struct netdev *netdev_ OVS_UNUSED,
> +                        struct dpif_packet **pkt, int cnt)
> +{
> +    int i;
> +
> +    for (i = 0; i < cnt; i++) {
> +        vxlan_extract_md(pkt[i]);
> +    }
> +    return 0;
> +}
> +
> +static void
> +netdev_vxlan_build_header(const struct netdev *netdev, struct ofpbuf *header)
> +{
> +    struct netdev_vport *dev = netdev_vport_cast(netdev);
> +    struct netdev_tunnel_config *tnl_cfg;
> +    const struct eth_header *eth;
> +    struct ip_header *ip;
> +    struct udp_header *udp;
> +    struct vxlanhdr *vxh;
> +
> +    ovs_mutex_lock(&dev->mutex);
> +    tnl_cfg = &dev->tnl_cfg;

We ought to make tnl_cfg RCU to get rid of the mutex here (and likely elsewhere as well). Just a note, I think it would be better to do this in a separate patch anyway.

> +
> +    eth = (const struct eth_header *) ofpbuf_data(header);
> +
> +    ip = (struct ip_header *) ((const char *)eth + sizeof *eth);
> +    ip->ip_proto = IPPROTO_UDP;
> +
> +    udp = ofpbuf_put_zeros(header, sizeof *udp);
> +    /* Set src port in push tunnel. */
> +    udp->udp_src = 44444;

If push sets udp_src, why set it here at all?

> +    udp->udp_dst = tnl_cfg->dst_port;
> +    /* udp_csum is zero */

Would it not be the push that should be responsible for setting the checksum, if we used one?

> +
> +    vxh = ofpbuf_put_zeros(header, sizeof *vxh);

Could use ofpbuf_put_uninit() here instead...

> +    vxh->vx_flags = htonl(VXLAN_FLAGS);
> +    vxh->vx_vni = tnl_cfg->out_key & htonl(0xff);
> +

… as we are setting all the header values here.

> +    ovs_mutex_unlock(&dev->mutex);
> +}
> +
> +static void
> +netdev_vxlan_push_header__(struct ofpbuf *packet,
> +                           const void *header, int size)
> +{
> +    struct eth_header *eth;
> +    struct ip_header *ip;
> +    struct udp_header *udp;
> +    struct vxlanhdr *vxh;
> +    int hdr_size;

I’d call this “l2_size”

> +    int tot_size;

and this “ip_tot_size”, would make code easier to follow.

> +
> +    eth = ofpbuf_push_uninit(packet, size);
> +    hdr_size = sizeof *ip + sizeof *udp + sizeof *vxh;

l2_size = size - (sizeof *ip + sizeof *udp + sizeof *vxh);

> +    tot_size  = ofpbuf_size(packet) - size + hdr_size;

ip_tot_size = ofpbuf_size(packet) - l2_size;

> +
> +    memcpy(eth, header, size);
> +    ip = (struct ip_header *) ((const char *)eth + sizeof *eth);

Use this instead, to be better compatible with VLANs:

ip = (struct ip_header *) ((const char *)eth + l2_size);

> +    ip->ip_tot_len = htons(tot_size);
> +    ip->ip_csum = csum(ip, sizeof *ip);

Could we precompute the ip_csum for zero ip_tot_len and then update the checksum here based on the real value?

> +
> +    udp = (struct udp_header *) (ip + 1);
> +    /* set udp src port*/
> +    udp->udp_len = htons(ntohs(ip->ip_tot_len) - sizeof *ip);

IMO we should also set the UDP checksum (to zero) here.

> +}
> +
> +static int
> +netdev_vxlan_push_header(const struct netdev *netdev OVS_UNUSED,
> +                         struct dpif_packet **buffers, int cnt,

‘packets' would be a better name for the ‘buffers’ argument.

> +                         const void *header, int size)
> +{
> +    int i;
> +
> +    for (i = 0; i < cnt; i++) {
> +        netdev_vxlan_push_header__(&buffers[i]->ofpbuf, header, size);
> +    }
> +    return 0;
> +}
> +
> 
> #define VPORT_FUNCTIONS(GET_CONFIG, SET_CONFIG,             \
> -                        GET_TUNNEL_CONFIG, GET_STATUS)      \
> +                        GET_TUNNEL_CONFIG, GET_STATUS,      \
> +                        BUILD_HEADER,                       \
> +                        PUSH_HEADER, POP_HEADER)            \
>     NULL,                                                   \
>     netdev_vport_run,                                       \
>     netdev_vport_wait,                                      \
> @@ -775,6 +919,9 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats)
>     GET_CONFIG,                                             \
>     SET_CONFIG,                                             \
>     GET_TUNNEL_CONFIG,                                      \
> +    BUILD_HEADER,                                           \
> +    PUSH_HEADER,                                            \
> +    POP_HEADER,                                             \
>     NULL,                       /* get_numa_id */           \
>     NULL,                       /* set_multiq */            \
>                                                             \
> @@ -826,12 +973,14 @@ get_stats(const struct netdev *netdev, struct netdev_stats *stats)
>     NULL,                   /* rx_wait */                   \
>     NULL,                   /* rx_drain */
> 
> -#define TUNNEL_CLASS(NAME, DPIF_PORT)                       \
> -    { DPIF_PORT,                                            \
> -        { NAME, VPORT_FUNCTIONS(get_tunnel_config,          \
> -                                set_tunnel_config,          \
> -                                get_netdev_tunnel_config,   \
> -                                tunnel_get_status) }}
> +
> +#define TUNNEL_CLASS(NAME, DPIF_PORT, BUILD_HEADER, PUSH_HEADER, POP_HEADER)   \
> +    { DPIF_PORT,                                                               \
> +        { NAME, VPORT_FUNCTIONS(get_tunnel_config,                             \
> +                                set_tunnel_config,                             \
> +                                get_netdev_tunnel_config,                      \
> +                                tunnel_get_status,                             \
> +                                BUILD_HEADER, PUSH_HEADER, POP_HEADER) }}
> 
> void
> netdev_vport_tunnel_register(void)
> @@ -839,13 +988,14 @@ netdev_vport_tunnel_register(void)
>     /* The name of the dpif_port should be short enough to accomodate adding
>      * a port number to the end if one is necessary. */
>     static const struct vport_class vport_classes[] = {
> -        TUNNEL_CLASS("geneve", "genev_sys"),
> -        TUNNEL_CLASS("gre", "gre_sys"),
> -        TUNNEL_CLASS("ipsec_gre", "gre_sys"),
> -        TUNNEL_CLASS("gre64", "gre64_sys"),
> -        TUNNEL_CLASS("ipsec_gre64", "gre64_sys"),
> -        TUNNEL_CLASS("vxlan", "vxlan_sys"),
> -        TUNNEL_CLASS("lisp", "lisp_sys”)

Add line for geneve here

> +        TUNNEL_CLASS("gre", "gre_sys", NULL, NULL, NULL),
> +        TUNNEL_CLASS("ipsec_gre", "gre_sys", NULL, NULL, NULL),
> +        TUNNEL_CLASS("gre64", "gre64_sys", NULL,  NULL, NULL),
> +        TUNNEL_CLASS("ipsec_gre64", "gre64_sys", NULL, NULL, NULL),
> +        TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
> +                                           netdev_vxlan_push_header,
> +                                           netdev_vxlan_pop_header),
> +        TUNNEL_CLASS("lisp", "lisp_sys", NULL, NULL, NULL)
>     };
>     static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> 
> @@ -867,6 +1017,6 @@ netdev_vport_patch_register(void)
>             { "patch", VPORT_FUNCTIONS(get_patch_config,
>                                        set_patch_config,
>                                        NULL,
> -                                       NULL) }};
> +                                       NULL, NULL, NULL, NULL) }};
>     netdev_register_provider(&patch_class.netdev_class);
> }
> diff --git a/lib/netdev.c b/lib/netdev.c
> index a9ad7d1..970fafc 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -724,6 +724,40 @@ netdev_send(struct netdev *netdev, int qid, struct dpif_packet **buffers,
>     return error;
> }
> 
> +int
> +netdev_pop_header(struct netdev *netdev, struct dpif_packet **buffers, int cnt)
> +{
> +    int error;
> +
> +    error = (netdev->netdev_class->pop_header
> +             ? netdev->netdev_class->pop_header(netdev, buffers, cnt)
> +             : EOPNOTSUPP);
> +    if (!error) {
> +        COVERAGE_INC(netdev_sent);

Is this coverage intentional?

> +    }
> +    return error;
> +}
> +
> +void
> +netdev_build_header(const struct netdev *netdev, struct ofpbuf *header)
> +{
> +    if (netdev->netdev_class->build_header) {
> +        netdev->netdev_class->build_header(netdev, header);
> +    }
> +}
> +
> +int
> +netdev_push_header(const struct netdev *netdev,
> +                   struct dpif_packet **buffers, int cnt,
> +                   const void *header, int size)
> +{
> +    if (netdev->netdev_class->push_header) {
> +        return netdev->netdev_class->push_header(netdev, buffers, cnt, header, size);
> +    } else {
> +        return -EINVAL;
> +    }
> +}
> +
> /* Registers with the poll loop to wake up from the next call to poll_block()
>  * when the packet transmission queue has sufficient room to transmit a packet
>  * with netdev_send().
> diff --git a/lib/netdev.h b/lib/netdev.h
> index fc4180a..3af0665 100644
> --- a/lib/netdev.h
> +++ b/lib/netdev.h
> @@ -180,6 +180,13 @@ int netdev_send(struct netdev *, int qid, struct dpif_packet **, int cnt,
>                 bool may_steal);
> void netdev_send_wait(struct netdev *, int qid);
> 
> +void netdev_build_header(const struct netdev *, struct ofpbuf *);
> +int netdev_push_header(const struct netdev *netdev,
> +                       struct dpif_packet **buffers, int cnt,
> +                       const void *header, int size);
> +int netdev_pop_header(struct netdev *netdev, struct dpif_packet **buffers,
> +                      int cnt);
> +
> /* Hardware address. */
> int netdev_set_etheraddr(struct netdev *, const uint8_t mac[6]);
> int netdev_get_etheraddr(const struct netdev *, uint8_t mac[6]);
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index b179cf2..861257c 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -412,6 +412,8 @@ odp_execute_actions(void *dp, struct dpif_packet **packets, int cnt, bool steal,
>         switch ((enum ovs_action_attr) type) {
>             /* These only make sense in the context of a datapath. */
>         case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_TUNNEL_PUSH:
> +        case OVS_ACTION_ATTR_TUNNEL_POP:
>         case OVS_ACTION_ATTR_USERSPACE:
>         case OVS_ACTION_ATTR_RECIRC:
>             if (dp_execute_action) {
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index b9e1e21..20d6146 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -30,10 +30,12 @@
> #include "dynamic-string.h"
> #include "flow.h"
> #include "netlink.h"
> +#include "vxlan.h"
> #include "ofpbuf.h"
> #include "packets.h"
> #include "simap.h"
> #include "timeval.h"
> +#include "unaligned.h"
> #include "util.h"
> #include "vlog.h"
> 
> @@ -74,6 +76,8 @@ odp_action_len(uint16_t type)
> 
>     switch ((enum ovs_action_attr) type) {
>     case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_TUNNEL_PUSH: return -2;
> +    case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
>     case OVS_ACTION_ATTR_USERSPACE: return -2;
>     case OVS_ACTION_ATTR_PUSH_VLAN: return sizeof(struct ovs_action_push_vlan);
>     case OVS_ACTION_ATTR_POP_VLAN: return 0;
> @@ -507,6 +511,86 @@ format_odp_hash_action(struct ds *ds, const struct ovs_action_hash *hash_act)
> }
> 
> static void
> +format_odp_tunnel_vxlan_header(struct ds *ds, const struct nlattr *a)
> +{
> +    const struct eth_header *eth;
> +    const struct ip_header *ip;
> +    const struct udp_header *udp;
> +    const struct vxlanhdr *vxh;
> +
> +    /* TODO: handle other tunnel data. */
> +    eth = nl_attr_get(a);
> +    ip = (const struct ip_header *)   ((const char *)eth + sizeof(*eth));

Need to parse VLANs here?

> +    udp = (const struct udp_header *) ((const char *)ip  + sizeof(*ip));
> +    vxh = (const struct vxlanhdr *)   ((const char *)udp + sizeof(*udp));
> +
> +    /* Ehernet */

“Ethernet”

> +    ds_put_format(ds, "header(eth(src=");
> +    ds_put_format(ds, ETH_ADDR_FMT, ETH_ADDR_ARGS(eth->eth_src));
> +    ds_put_format(ds, ",dst=");
> +    ds_put_format(ds, ETH_ADDR_FMT, ETH_ADDR_ARGS(eth->eth_dst));
> +    ds_put_format(ds, ",dl_type=0x%04"PRIx16"),", ntohs(eth->eth_type));
> +
> +    /* IPv4 */
> +    ds_put_format(ds, "ipv4(src="IP_FMT",dst="IP_FMT",proto=%"PRIu8
> +                  ",tos=%#"PRIx8",ttl=%"PRIu8",frag=%"PRIx16"),",
> +                  IP_ARGS(get_16aligned_be32(&ip->ip_src)),
> +                  IP_ARGS(get_16aligned_be32(&ip->ip_dst)),
> +                  ip->ip_proto, ip->ip_tos,
> +                  ip->ip_ttl,
> +                  ip->ip_frag_off);
> +
> +    /* UDP */
> +    ds_put_format(ds, "udp(src=%"PRIu16",dst=%"PRIu16"),",
> +                  ntohs(udp->udp_src), ntohs(udp->udp_dst));
> +
> +    /* VxLan */
> +    ds_put_format(ds, "vxlan(flags=%"PRIx32",vni=%"PRIx32")",
> +                  vxh->vx_flags, vxh->vx_vni);
> +
> +    ds_put_format(ds, ")");
> +}
> +
> +static void
> +format_odp_tunnel_push_action(struct ds *ds, const struct nlattr *attrs,
> +                              size_t len)
> +{
> +    enum ovs_tunnel_action_attr type;
> +    const struct nlattr *a;
> +    unsigned int left;
> +
> +    ds_put_format(ds, "tunnel_push(");
> +    NL_ATTR_FOR_EACH (a, left, attrs, len) {
> +        if (a != attrs) {
> +            ds_put_cstr(ds, ",");
> +        }
> +
> +        type = nl_attr_type(a);
> +        switch(type) {
> +            case OVS_ACTION_TUNNEL_HEADER_PORT:
> +                ds_put_format(ds, "tnl_port(%"PRIu32, nl_attr_get_u32(a));
> +                ds_put_cstr(ds, ")");
> +            break;
> +
> +            case OVS_ACTION_TUNNEL_HEADER_PHY_PORT:
> +                ds_put_format(ds, "phy_port(%"PRIu32, nl_attr_get_u32(a));
> +                ds_put_cstr(ds, ")");
> +            break;
> +
> +            case OVS_ACTION_TUNNEL_HEADER_DATA:
> +                format_odp_tunnel_vxlan_header(ds, a);
> +            break;
> +
> +            default:
> +                ds_put_cstr(ds, "<error>");
> +            break;
> +        }
> +    }
> +    ds_put_format(ds, ")");
> +
> +}
> +
> +static void
> format_odp_action(struct ds *ds, const struct nlattr *a)
> {
>     int expected_len;
> @@ -526,6 +610,13 @@ format_odp_action(struct ds *ds, const struct nlattr *a)
>     case OVS_ACTION_ATTR_OUTPUT:
>         ds_put_format(ds, "%"PRIu32, nl_attr_get_u32(a));
>         break;
> +    case OVS_ACTION_ATTR_TUNNEL_POP:
> +        ds_put_format(ds, "tnl_pop(%"PRIu32, nl_attr_get_u32(a));
> +        ds_put_cstr(ds, ")");
> +        break;
> +    case OVS_ACTION_ATTR_TUNNEL_PUSH:
> +        format_odp_tunnel_push_action(ds, nl_attr_get(a), nl_attr_get_size(a));
> +        break;
>     case OVS_ACTION_ATTR_USERSPACE:
>         format_odp_userspace_action(ds, a);
>         break;
> @@ -748,6 +839,16 @@ parse_odp_action(const char *s, const struct simap *port_names,
>         uint32_t port;
>         int n;
> 
> +        if (ovs_scan(s, "tnl_queue(%"SCNi32")%n", &port, &n)) {
> +            nl_msg_put_u32(actions, OVS_ACTION_ATTR_OUTPUT, port);
> +            return n;
> +        }

What is this?

> +    }
> +
> +    {
> +        uint32_t port;
> +        int n;
> +
>         if (ovs_scan(s, "%"SCNi32"%n", &port, &n)) {
>             nl_msg_put_u32(actions, OVS_ACTION_ATTR_OUTPUT, port);
>             return n;
> @@ -3543,6 +3644,23 @@ odp_put_tunnel_action(const struct flow_tnl *tunnel,
>     tun_key_to_attr(odp_actions, tunnel);
>     nl_msg_end_nested(odp_actions, offset);
> }
> +
> +void
> +odp_put_build_tunnel_header_action(struct ofpbuf *odp_actions, struct ofpbuf *header,
> +                                   odp_port_t tunnel_port, odp_port_t phy_port)

Should be called “odp_put_tunnel_push_action” instead?

> +{
> +    size_t offset = nl_msg_start_nested(odp_actions, OVS_ACTION_ATTR_TUNNEL_PUSH);
> +
> +    nl_msg_put_odp_port(odp_actions, OVS_ACTION_TUNNEL_HEADER_PORT,
> +                        tunnel_port);
> +    nl_msg_put_unspec(odp_actions, OVS_ACTION_TUNNEL_HEADER_DATA,
> +                      ofpbuf_data(header), ofpbuf_size(header));
> +    nl_msg_put_odp_port(odp_actions, OVS_ACTION_TUNNEL_HEADER_PHY_PORT,
> +                        phy_port);
> +
> +    nl_msg_end_nested(odp_actions, offset);
> +}
> +
> 
> /* The commit_odp_actions() function and its helpers. */
> 
> diff --git a/lib/odp-util.h b/lib/odp-util.h
> index 11b54dd..64e7dc3 100644
> --- a/lib/odp-util.h
> +++ b/lib/odp-util.h
> @@ -253,5 +253,9 @@ size_t odp_put_userspace_action(uint32_t pid,
>                                 struct ofpbuf *odp_actions);
> void odp_put_tunnel_action(const struct flow_tnl *tunnel,
>                            struct ofpbuf *odp_actions);
> +void odp_put_build_tunnel_header_action(struct ofpbuf *odp_actions,
> +                                        struct ofpbuf *header,
> +                                        odp_port_t tunnel_odp_port,
> +                                        odp_port_t phy_port);
> 
> #endif /* odp-util.h */
> diff --git a/lib/ofpbuf.h b/lib/ofpbuf.h
> index adaf526..72c254f 100644
> --- a/lib/ofpbuf.h
> +++ b/lib/ofpbuf.h
> @@ -422,6 +422,14 @@ static inline void ofpbuf_set_size(struct ofpbuf *b, uint32_t v)
> }
> #endif
> 
> +static inline void ofpbuf_reset_packet(struct ofpbuf *b, int off)
> +{
> +    ofpbuf_set_size(b, ofpbuf_size(b) - off);
> +    ofpbuf_set_data(b, (void *) ((unsigned char *) b->frame + off));
> +    b->frame = NULL;
> +    b->l2_5_ofs = b->l3_ofs = b->l4_ofs = UINT16_MAX;
> +}
> +
> #ifdef  __cplusplus
> }
> #endif
> diff --git a/lib/packets.c b/lib/packets.c
> index 65d8109..a0cfadc 100644
> --- a/lib/packets.c
> +++ b/lib/packets.c
> @@ -956,3 +956,38 @@ packet_format_tcp_flags(struct ds *s, uint16_t tcp_flags)
>         ds_put_cstr(s, "[800]");
>     }
> }
> +
> +#define ARP_PACKET_SIZE  (2 + ETH_HEADER_LEN + VLAN_HEADER_LEN + \
> +                          ARP_ETH_HEADER_LEN)
> +
> +void
> +compose_arp(struct ofpbuf *b, const uint8_t eth_src[ETH_ADDR_LEN],
> +            ovs_be32 ip_src, ovs_be32 ip_dst)
> +{
> +    struct eth_header *eth;
> +    struct arp_eth_header *arp;
> +
> +    ofpbuf_clear(b);
> +    ofpbuf_prealloc_tailroom(b, ARP_PACKET_SIZE);
> +    ofpbuf_reserve(b, 2 + VLAN_HEADER_LEN);
> +
> +    eth = ofpbuf_put_uninit(b, sizeof *eth);
> +    memcpy(eth->eth_dst, eth_addr_broadcast, ETH_ADDR_LEN);
> +    memcpy(eth->eth_src, eth_src, ETH_ADDR_LEN);
> +    eth->eth_type = htons(ETH_TYPE_ARP);
> +
> +    arp = ofpbuf_put_uninit(b, sizeof *arp);
> +    arp->ar_hrd = htons(ARP_HRD_ETHERNET);
> +    arp->ar_pro = htons(ARP_PRO_IP);
> +    arp->ar_hln = sizeof arp->ar_sha;
> +    arp->ar_pln = sizeof arp->ar_spa;
> +    arp->ar_op = htons(ARP_OP_REQUEST);
> +    memcpy(arp->ar_sha, eth_src, ETH_ADDR_LEN);
> +    memcpy(arp->ar_tha, eth_src, ETH_ADDR_LEN);

I think the tha should be set to all-zeroes instead. RFC 826 says:

“It does not set ar$tha to anything in particular, because it is this value that it is trying to determine.”

> +
> +    put_16aligned_be32(&arp->ar_spa, ip_src);
> +    put_16aligned_be32(&arp->ar_tpa, ip_dst);
> +
> +    ofpbuf_set_frame(b, eth);
> +    ofpbuf_set_l3(b, arp);
> +}
> diff --git a/lib/packets.h b/lib/packets.h
> index 26c6ff1..b24015e 100644
> --- a/lib/packets.h
> +++ b/lib/packets.h
> @@ -486,6 +486,7 @@ void ip_format_masked(ovs_be32 ip, ovs_be32 mask, struct ds *);
>         ((ip_frag_off) & htons(IP_MORE_FRAGMENTS | IP_FRAG_OFF_MASK))
> 
> #define IP_HEADER_LEN 20
> +OVS_PACKED(
> struct ip_header {
>     uint8_t ip_ihl_ver;
>     uint8_t ip_tos;
> @@ -497,7 +498,8 @@ struct ip_header {
>     ovs_be16 ip_csum;
>     ovs_16aligned_be32 ip_src;
>     ovs_16aligned_be32 ip_dst;
> -};
> +});
> +

I’d rather not pack this, if possible.

> BUILD_ASSERT_DECL(IP_HEADER_LEN == sizeof(struct ip_header));
> 
> #define ICMP_HEADER_LEN 8
> @@ -545,12 +547,13 @@ struct sctp_header {
> BUILD_ASSERT_DECL(SCTP_HEADER_LEN == sizeof(struct sctp_header));
> 
> #define UDP_HEADER_LEN 8
> +OVS_PACKED(
> struct udp_header {
>     ovs_be16 udp_src;
>     ovs_be16 udp_dst;
>     ovs_be16 udp_len;
>     ovs_be16 udp_csum;
> -};
> +});

Same here.

> BUILD_ASSERT_DECL(UDP_HEADER_LEN == sizeof(struct udp_header));
> 
> #define TCP_FIN 0x001
> @@ -735,5 +738,7 @@ void packet_set_sctp_port(struct ofpbuf *, ovs_be16 src, ovs_be16 dst);
> 
> void packet_format_tcp_flags(struct ds *, uint16_t);
> const char *packet_tcp_flag_to_string(uint32_t flag);
> +void compose_arp(struct ofpbuf *b, const uint8_t eth_src[ETH_ADDR_LEN],
> +                 ovs_be32 ip_src, ovs_be32 ip_dst);
> 
> #endif /* packets.h */
> diff --git a/lib/tnl-arp-cache.c b/lib/tnl-arp-cache.c
> new file mode 100644
> index 0000000..2327452
> --- /dev/null
> +++ b/lib/tnl-arp-cache.c
> @@ -0,0 +1,214 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +#include <inttypes.h>
> +#include <stdlib.h>
> +
> +#include "bitmap.h"
> +#include "coverage.h"
> +#include "dpif-netdev.h"
> +#include "dynamic-string.h"
> +#include "errno.h"
> +#include "flow.h"
> +#include "hash.h"
> +#include "hmap.h"
> +#include "list.h"
> +#include "netdev.h"
> +#include "ovs-thread.h"
> +#include "packets.h"
> +#include "packet-dpif.h"
> +#include "poll-loop.h"
> +#include "timeval.h"
> +#include "tnl-arp-cache.h"
> +#include "unaligned.h"
> +#include "unixctl.h"
> +#include "util.h"
> +#include "vlog.h"
> +
> +VLOG_DEFINE_THIS_MODULE(arp);
> +
> +#define ARP_ENTRY_DEFAULT_IDLE_TIME  30000
> +
> +struct tnl_arp_entry {
> +    struct hmap_node hmap_node;

IMO it would be better to use cmap, that way the lookups could be lockless.

> +    ovs_be32 ip;
> +    uint8_t mac[ETH_ADDR_LEN];
> +    time_t expires;             /* Expiration time. */
> +    char br_name[IFNAMSIZ];
> +};
> +
> +static struct hmap table = HMAP_INITIALIZER(&table);
> +static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;

Would still need a mutex for excluding concurrent modifications.

> +static uint32_t secret;
> +
> +static uint32_t
> +tnl_arp_table_hash(ovs_be32 ip4)
> +{
> +    return hash_2words(ntohl(ip4), secret);
> +}
> +

cmap is already internally rehashing the given hash, so it would be possible to just give the dst address as the ‘hash’ to cmap functions.

> +static struct tnl_arp_entry *
> +__tnl_arp_lookup(const char br_name[IFNAMSIZ], ovs_be32 dst)
> +{
> +    struct tnl_arp_entry *arp;
> +    uint32_t hash;
> +
> +    hash = tnl_arp_table_hash(dst);
> +    HMAP_FOR_EACH_WITH_HASH (arp, hmap_node, hash, &table) {
> +        if (arp->ip == dst && !strncmp(arp->br_name, br_name, IFNAMSIZ)) {
> +            return arp;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +int
> +tnl_arp_lookup(const char br_name[IFNAMSIZ], ovs_be32 dst, uint8_t mac[ETH_ADDR_LEN])
> +{
> +    struct tnl_arp_entry *arp;
> +    int res = -ENOENT;
> +
> +    ovs_rwlock_rdlock(&rwlock);
> +    VLOG_ERR("pbs: %s port %s\n",__func__,br_name);

Logging should be conditional or removed.

> +    arp = __tnl_arp_lookup(br_name, dst);
> +    if (arp) {
> +            memcpy(mac, arp->mac, ETH_ADDR_LEN);
> +            res = 0;
> +    }
> +    ovs_rwlock_unlock(&rwlock);
> +
> +    return res;
> +}
> +
> +static void
> +tnl_arp_delete(struct tnl_arp_entry *arp)
> +{
> +    hmap_remove(&table, &arp->hmap_node);
> +    free(arp);
> +}
> +
> +int
> +tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc,
> +          const char name[IFNAMSIZ])
> +{
> +    struct tnl_arp_entry *arp;
> +    uint32_t hash;
> +
> +    if (flow->dl_type != htons(ETH_TYPE_ARP)) {
> +        return -EINVAL;
> +    }
> +
> +    memset(&wc->masks.dl_type, 0xff, sizeof wc->masks.dl_type);
> +    memset(&wc->masks.nw_proto, 0xff, sizeof wc->masks.nw_proto);
> +    memset(&wc->masks.nw_src, 0xff, sizeof wc->masks.nw_src);
> +    memset(&wc->masks.nw_dst, 0xff, sizeof wc->masks.nw_dst);
> +    memset(&wc->masks.arp_sha, 0xff, sizeof wc->masks.arp_sha);
> +    memset(&wc->masks.arp_tha, 0xff, sizeof wc->masks.arp_tha);
> +
> +    ovs_rwlock_wrlock(&rwlock);
> +    arp = __tnl_arp_lookup(name, flow->nw_src);
> +    if (arp) {
> +        if (!memcmp(arp->mac, flow->arp_sha, ETH_ADDR_LEN)) {
> +            arp->expires = time_now() + ARP_ENTRY_DEFAULT_IDLE_TIME;
> +            ovs_rwlock_unlock(&rwlock);
> +            return 0;
> +        }
> +        tnl_arp_delete(arp);
> +    }
> +
> +    arp = xmalloc(sizeof *arp);
> +
> +    arp->ip = flow->nw_src;
> +    memcpy(arp->mac, flow->arp_sha, ETH_ADDR_LEN);
> +    arp->expires = time_now() + ARP_ENTRY_DEFAULT_IDLE_TIME;
> +    strncpy(arp->br_name, name, IFNAMSIZ);
> +    hash = tnl_arp_table_hash(arp->ip);
> +    hmap_insert(&table, &arp->hmap_node, hash);
> +    ovs_rwlock_unlock(&rwlock);
> +    return 0;
> +}
> +
> +void
> +tnl_arp_cache_run(void)
> +{
> +    struct tnl_arp_entry *arp;
> +
> +    ovs_rwlock_wrlock(&rwlock);
> +    HMAP_FOR_EACH(arp, hmap_node, &table) {
> +        if (arp->expires <= time_now()) {
> +             tnl_arp_delete(arp);
> +        }
> +    }
> +    ovs_rwlock_unlock(&rwlock);
> +}
> +
> +static void
> +tnl_arp_cache_flush(struct unixctl_conn *conn OVS_UNUSED, int argc OVS_UNUSED,
> +                const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
> +{
> +    struct tnl_arp_entry *arp, *temp;
> +
> +    ovs_rwlock_wrlock(&rwlock);
> +    HMAP_FOR_EACH_SAFE(arp, temp, hmap_node, &table) {
> +          tnl_arp_delete(arp);
> +    }
> +    ovs_rwlock_unlock(&rwlock);
> +    unixctl_command_reply(conn, "OK");
> +}
> +
> +#define MAX_IP_ADDR_LEN 16
> +
> +static void
> +tnl_arp_cache_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +               const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
> +{
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +    struct tnl_arp_entry *arp;
> +
> +    ds_put_cstr(&ds, "IP               MAC                 Bridge\n");
> +    ds_put_cstr(&ds, "=============================================\n");
> +    ovs_rwlock_rdlock(&rwlock);
> +    HMAP_FOR_EACH(arp, hmap_node, &table) {
> +        int start_len, need_ws;
> +
> +        start_len = ds.length;
> +        ds_put_format(&ds, IP_FMT, IP_ARGS(arp->ip));
> +
> +        need_ws = MAX_IP_ADDR_LEN - (ds.length - start_len);
> +
> +        while (need_ws >= 0) {
> +            ds_put_format(&ds, " ");
> +            need_ws--;
> +        }
> +
> +        ds_put_format(&ds, ETH_ADDR_FMT"   %s\n",
> +                      ETH_ADDR_ARGS(arp->mac), arp->br_name);
> +
> +    }
> +    ovs_rwlock_unlock(&rwlock);
> +    unixctl_command_reply(conn, ds_cstr(&ds));
> +    ds_destroy(&ds);
> +}
> +
> +void
> +tnl_arp_cache_init(void)
> +{
> +    /* XXX: Add documentation for these commands. */
> +    unixctl_command_register("tnl/arp/show", "", 0, 0, tnl_arp_cache_show, NULL);
> +    unixctl_command_register("tnl/arp/flush", "", 0, 0, tnl_arp_cache_flush, NULL);
> +    secret = random_uint32();
> +}
> diff --git a/lib/tnl-arp-cache.h b/lib/tnl-arp-cache.h
> new file mode 100644
> index 0000000..f68c3b4
> --- /dev/null
> +++ b/lib/tnl-arp-cache.h
> @@ -0,0 +1,47 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef OVS_TNL_ARP_CACHE_H
> +#define OVS_TNL_ARP_CACHE_H 1
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <net/if.h>
> +#include <sys/socket.h>
> +
> +#include "flow.h"
> +#include "netdev.h"
> +#include "packets.h"
> +#include "util.h"
> +
> +#ifdef  __cplusplus
> +extern "C" {
> +#endif
> +
> +int tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc,
> +                  const char dev_name[]);
> +int tnl_arp_lookup(const char dev_name[], ovs_be32 dst, uint8_t mac[ETH_ADDR_LEN]);
> +void tnl_arp_cache_init(void);
> +void tnl_arp_cache_run(void);
> +
> +#ifdef  __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/tnl-ports.c b/lib/tnl-ports.c
> new file mode 100644
> index 0000000..4dbda7d
> --- /dev/null
> +++ b/lib/tnl-ports.c
> @@ -0,0 +1,203 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include "bitmap.h"
> +#include "classifier.h"
> +#include "coverage.h"
> +#include "dpif-netdev.h"
> +#include "fat-rwlock.h"
> +#include "dynamic-string.h"
> +#include "errno.h"
> +#include "flow.h"
> +#include "hash.h"
> +#include "list.h"
> +#include "netdev.h"
> +#include "ovs-thread.h"
> +#include "odp-util.h"
> +#include "packet-dpif.h"
> +#include "packets.h"
> +#include "poll-loop.h"
> +#include "timeval.h"
> +#include "tnl-arp-cache.h"
> +#include "tnl-ports.h"
> +#include "tnl-ports.h"
> +#include "unaligned.h"
> +#include "unixctl.h"
> +#include "util.h"
> +
> +static struct classifier cls;
> +static struct ovs_mutex tunnel_mutex = OVS_MUTEX_INITIALIZER;
> +
> +struct tunnel_ports {
> +    struct cls_rule cr;
> +    odp_port_t portno;
> +    struct ovs_refcount ref_cnt;
> +    char dev_name[IFNAMSIZ];
> +};

Why plural name for the struct?

> +
> +static struct tunnel_ports *
> +tnl_port_cast(const struct cls_rule *cr)
> +{

if (offsetof(struct tunnel_ports, cr) == 0) {
  return (struct tunnel_ports *)cr;
}

> +    return cr ? CONTAINER_OF(cr, struct tunnel_ports, cr) : NULL;
> +}
> +
> +void
> +tnl_port_insert_udp(odp_port_t port, ovs_be32 ip_dst, ovs_be16 udp_port,
> +                    const char dev_name[])
> +{
> +    const struct cls_rule *cr;
> +    struct tunnel_ports *p;
> +    struct flow_wildcards wc;
> +    struct match match;
> +    struct flow flow;
> +
> +    memset(&flow, 0, sizeof flow);
> +
> +    flow.dl_type = htons(ETH_TYPE_IP);
> +    flow.nw_proto = IPPROTO_UDP;
> +    flow.tp_dst = udp_port;
> +    flow.nw_src = ip_dst;
> +

Please document the weirdness of setting flow.nw_src to ip_dst.

> +    cr = classifier_lookup(&cls, &flow, NULL);
> +    if (cr) {
> +        struct tunnel_ports *p;
> +
> +        p = tnl_port_cast(cr);
> +
> +        ovs_refcount_ref(&p->ref_cnt);
> +        return;
> +    }


You should loop with ovs_refcount_try_ref_rcu() instead, if it is possible that a parallel thread is removing the same entry at the same time, like this:

    do {
        cr = classifier_lookup(&cls, &flow, NULL);

        p = tnl_port_cast(cr);

        /* Try again if the rule was released before we get the reference. */
    } while (p && !ovs_refcount_try_ref_rcu(&p->ref_cnt));

    if (p) {
       return;
    }

Otherwise you could end up calling ovs_refcount_ref() when the count has already reached zero, which will assert fail.

> +
> +    p = xzalloc(sizeof *p);
> +    p->portno = port;
> +
> +    memset(&wc, 0, sizeof(wc));
> +    wc.masks.dl_type = 0xffff;

need htons() here.

> +    wc.masks.nw_proto = 0xff;
> +    wc.masks.nw_frag = 0xff;
> +    wc.masks.tp_dst = 0xffff;

and here.

> +    wc.masks.nw_src = 0xffffffff;

and htonl() here.

> +
> +    /* Need to check IP address. */
> +    match_init(&match, &flow, &wc);

It would be a bit more efficient to use 'match.flow' and 'match.wc' directly instead of ‘flow' and ‘wc’, and then calling match_init().

> +    cls_rule_init(&p->cr, &match, 1);
> +    ovs_refcount_init(&p->ref_cnt);
> +    strncpy(p->dev_name, dev_name, IFNAMSIZ);
> +
> +    ovs_mutex_lock(&tunnel_mutex);
> +    classifier_insert(&cls, &p->cr);
> +    ovs_mutex_unlock(&tunnel_mutex);

Classifier has it’s own internal mutex that serializes mutations, so the locking here is not necessary.

> +}
> +
> +static void
> +tnl_port_unref(struct cls_rule *cr)
> +{
> +    if (cr) {
> +        struct tunnel_ports *p;
> +
> +        p = tnl_port_cast(cr);
> +        if (ovs_refcount_unref_relaxed(&p->ref_cnt) == 1) {
> +           classifier_remove(&cls, cr);
> +        }
> +    }
> +}
> +
> +void
> +tnl_port_delete(ovs_be32 ip_dst, ovs_be16 udp_port)
> +{
> +    struct cls_rule *cr;
> +    struct flow flow;
> +
> +    memset(&flow, 0, sizeof flow);
> +    flow.dl_type = htons(ETH_TYPE_IP);
> +    flow.nw_proto = IPPROTO_UDP;
> +    flow.tp_dst = udp_port;
> +    flow.nw_src = ip_dst;
> +
> +    ovs_mutex_lock(&tunnel_mutex);
> +    cr = classifier_lookup(&cls, &flow, NULL);
> +    tnl_port_unref(cr);
> +    ovs_mutex_unlock(&tunnel_mutex);

The refcount already protects against multiple deletes, so the mutex protection here is not necessary,

> +}
> +
> +odp_port_t
> +tnl_port_lookup(const struct flow *flow, struct flow_wildcards *wc)
> +{
> +    const struct cls_rule *cr;
> +    const struct tunnel_ports *p;
> +
> +    /* Un wildcard protocol and stuff */
> +    cr = classifier_lookup(&cls, flow, wc);
> +    if (!cr) {
> +        return ODPP_NONE;
> +    }
> +    p = tnl_port_cast(cr);
> +    return p->portno;
> +}
> +
> +static void
> +tnl_port_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +              const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
> +{
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +    const struct tunnel_ports *p;
> +
> +    ovs_mutex_lock(&tunnel_mutex);

Classifier internally prevents mutations during iteration, so the mutex here is not necessary.

> +    CLS_FOR_EACH(p, cr, &cls) {
> +        struct odputil_keybuf keybuf;
> +        struct odputil_keybuf maskbuf;
> +        struct flow flow;
> +        const struct nlattr *key, *mask;
> +        size_t key_len, mask_len;
> +        struct flow_wildcards wc;
> +        struct ofpbuf buf;
> +
> +        ds_put_format(&ds, "%s: ", p->dev_name);
> +        minimask_expand(&p->cr.match.mask, &wc);
> +        miniflow_expand(&p->cr.match.flow, &flow);
> +
> +        /* Key. */
> +        ofpbuf_use_stack(&buf, &keybuf, sizeof keybuf);
> +        odp_flow_key_from_flow(&buf, &flow, &wc.masks,
> +                               flow.in_port.odp_port, true);
> +        key = ofpbuf_data(&buf);
> +        key_len = ofpbuf_size(&buf);
> +        /* mask*/
> +        ofpbuf_use_stack(&buf, &maskbuf, sizeof maskbuf);
> +        odp_flow_key_from_mask(&buf, &wc.masks, &flow,
> +                               odp_to_u32(wc.masks.in_port.odp_port),
> +                               SIZE_MAX, false);
> +        mask = ofpbuf_data(&buf);
> +        mask_len = ofpbuf_size(&buf);
> +
> +        /* build string. */
> +        odp_flow_format(key, key_len, mask, mask_len, NULL, &ds, false);
> +    }
> +    ovs_mutex_unlock(&tunnel_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&ds));
> +    ds_destroy(&ds);
> +}
> +
> +void
> +tnl_port_init(void)
> +{
> +    classifier_init(&cls, flow_segment_u32s);
> +    unixctl_command_register("tnl/ports/show", "", 0, 0, tnl_port_show, NULL);
> +}
> diff --git a/lib/tnl-ports.h b/lib/tnl-ports.h
> new file mode 100644
> index 0000000..73355e3
> --- /dev/null
> +++ b/lib/tnl-ports.h
> @@ -0,0 +1,49 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef OVS_TNL_PORT_H
> +#define OVS_TNL_PORT_H 1
> +
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <inttypes.h>
> +#include <net/if.h>
> +#include <sys/socket.h>
> +
> +#include "flow.h"
> +#include "packets.h"
> +#include "util.h"
> +
> +#ifdef  __cplusplus
> +extern "C" {
> +#endif
> +
> +odp_port_t tnl_port_lookup(const struct flow *flow,
> +                           struct flow_wildcards *wc);
> +
> +void tnl_port_insert_udp(odp_port_t port, ovs_be32 ip_dst, ovs_be16 udp_port,
> +                         const char dev_name[]);
> +
> +void tnl_port_delete(ovs_be32 ip_dst OVS_UNUSED, ovs_be16 udp_port OVS_UNUSED);
> +
> +void tnl_port_init(void);
> +
> +#ifdef  __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/tnl-router.c b/lib/tnl-router.c
> new file mode 100644
> index 0000000..4397509
> --- /dev/null
> +++ b/lib/tnl-router.c
> @@ -0,0 +1,237 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +#include <arpa/inet.h>
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <sys/socket.h>
> +#include <net/if.h>
> +#include <netinet/in.h>
> +#include <stdarg.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +
> +#include "classifier.h"
> +#include "command-line.h"
> +#include "compiler.h"
> +#include "dirs.h"
> +#include "dpif.h"
> +#include "dynamic-string.h"
> +#include "fat-rwlock.h"
> +#include "flow.h"
> +#include "match.h"
> +#include "netdev.h"
> +#include "netlink.h"
> +#include "odp-util.h"
> +#include "ofp-parse.h”

Is this (and may other) includes necessary? Seems like a long list of includes for a simple source file!

> +#include "ofpbuf.h"
> +#include "ovs-thread.h"
> +#include "packets.h"
> +#include "shash.h"
> +#include "simap.h"
> +#include "smap.h"
> +#include "sset.h"
> +#include "timeval.h"
> +#include "tnl-router.h"
> +#include "tnl-ports.h"
> +#include "unixctl.h"
> +#include "util.h"
> +#include "list.h"
> +#include "util.h"
> +
> +static struct classifier cls;
> +static struct ovs_mutex mutex = OVS_MUTEX_INITIALIZER;
> +static int version, c_ver;
> +
> +struct tnl_route_entry {
> +    struct cls_rule cr;
> +    char output_bridge[IFNAMSIZ];
> +    ovs_be32 gw;
> +    ovs_be32 nw_addr;
> +    ovs_be32 mask;
> +};
> +
> +bool
> +tnl_route_reconfigured(void)
> +{
> +    if (version == c_ver) {
> +        return false;
> +    }
> +    /* version is not commited, send refconfigure event. */
> +    c_ver++;
> +    return true;
> +}
> +
> +static struct tnl_route_entry *
> +tnl_route_entry_cast(const struct cls_rule *cr)
> +{
> +    return cr ? CONTAINER_OF(cr, struct tnl_route_entry, cr) : NULL;
> +}
> +
> +int
> +tnl_route_lookup(ovs_be32 ip_dst, char output_bridge[], ovs_be32 *gw)
> +{
> +    const struct cls_rule *cr;
> +    struct flow flow;
> +
> +    memset(&flow, 0, sizeof(flow));
> +    flow.nw_dst = ip_dst;
> +
> +    cr = classifier_lookup(&cls, &flow, NULL);
> +    if (cr) {
> +        struct tnl_route_entry *p;
> +
> +        p = tnl_route_entry_cast(cr);
> +        strncpy(output_bridge, p->output_bridge, IFNAMSIZ);
> +        *gw = p->gw;
> +        return 0;
> +    }
> +    return -ENOENT;
> +}
> +
> +bool
> +tnl_route_check_dev(const char output_bridge[])
> +{
> +    struct tnl_route_entry *rt;
> +    bool res = false;
> +
> +    ovs_mutex_lock(&mutex);
> +

Locking here is not necessary.

> +    CLS_FOR_EACH(rt, cr, &cls) {
> +        if (!strncmp(output_bridge, rt->output_bridge, IFNAMSIZ)) {
> +            res = true;
> +        }
> +    }
> +    ovs_mutex_unlock(&mutex);
> +    return res;
> +}
> +
> +static void
> +rt_entry_insert(ovs_be32 ip_dst, ovs_be32 mask,
> +                const char output_bridge[], ovs_be32 gw)
> +{
> +    const struct cls_rule *cr;
> +    struct tnl_route_entry *p;
> +    struct flow_wildcards wc;
> +    struct flow flow;
> +    struct match match;
> +
> +    memset(&flow, 0, sizeof(flow));
> +    flow.nw_dst = ip_dst;
> +
> +    cr = classifier_lookup(&cls, &flow, NULL);
> +    if (cr) {
> +        return;
> +    }
> +
> +    p = xzalloc(sizeof *p);
> +    strncpy(p->output_bridge, output_bridge, IFNAMSIZ);
> +    p->gw = gw;
> +    p->nw_addr = ip_dst;
> +    p->mask = mask;
> +
> +    memset(&wc, 0, sizeof(wc));
> +    wc.masks.nw_dst = mask;
> +    /* Need to check IP address. */
> +    match_init(&match, &flow, &wc);
> +    cls_rule_init(&p->cr, &match, 1);
> +
> +    ovs_mutex_lock(&mutex);
> +    classifier_insert(&cls, &p->cr);
> +    version++;
> +    ovs_mutex_unlock(&mutex);

if version was an atomic variable, then the locking would not be necessary.

> +}
> +
> +static void
> +tnl_route_add(struct unixctl_conn *conn, int argc,
> +              const char *argv[], void *aux OVS_UNUSED)
> +{
> +    ovs_be32 ip, mask, gw;
> +
> +    inet_aton(argv[1], (struct in_addr *)&ip);
> +    inet_aton(argv[2], (struct in_addr *)&mask);
> +
> +    if (argc == 5) {
> +        inet_aton(argv[4], (struct in_addr *)&gw);
> +    } else {
> +        gw = 0;
> +    }
> +
> +    rt_entry_insert(ip, mask, argv[3], gw);
> +    unixctl_command_reply(conn, "OK");
> +}
> +
> +static void
> +tnl_route_del(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +              const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct cls_rule *cr;
> +    struct flow flow;
> +    ovs_be32 ip, mask;
> +
> +    inet_aton(argv[1], (struct in_addr *)&ip);
> +    inet_aton(argv[2], (struct in_addr *)&mask);
> +
> +    ovs_mutex_lock(&mutex);
> +    memset(&flow, 0, sizeof(flow));
> +    flow.nw_dst = ip & mask;
> +
> +    cr = classifier_lookup(&cls, &flow, NULL);
> +    if (cr) {
> +        classifier_remove(&cls, cr);
> +        version++;
> +        unixctl_command_reply(conn, "OK");
> +    } else {
> +        unixctl_command_reply(conn, "Not found");
> +    }
> +
> +    ovs_mutex_unlock(&mutex);

This is a bit of a bummer, but locking here is necessary to make sure that the ‘cr’ still exists in the classifier when classifier_remove is called. We should add a new function to take care of this (like classifier_remove_flow(&cls, &flow, priority) finds the exact matching rule and removes it if found.)


> +}
> +
> +static void
> +tnl_route_show(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +               const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
> +{
> +    struct tnl_route_entry *rt;
> +    struct ds ds = DS_EMPTY_INITIALIZER;
> +
> +    ovs_mutex_lock(&mutex);
> +

Locking is implicitly done for the duration of the iteration.

> +    CLS_FOR_EACH(rt, cr, &cls) {
> +        ds_put_format(&ds, "IP "IP_FMT" Mask "IP_FMT" dev %s",
> +                      IP_ARGS(rt->nw_addr), IP_ARGS(rt->mask), rt->output_bridge);
> +        if (rt->gw) {
> +	    ds_put_format(&ds, " GW "IP_FMT, IP_ARGS(rt->gw));
> +        }
> +	ds_put_format(&ds, "\n");
> +    }
> +    unixctl_command_reply(conn, ds_cstr(&ds));
> +    ds_destroy(&ds);
> +
> +    ovs_mutex_unlock(&mutex);
> +}
> +
> +void
> +tnl_route_unixctl_register(void)
> +{
> +    /* XXX: Add documentation for these commands. */
> +    unixctl_command_register("tnl/route/add", "ip mask dev gw", 3, 4, tnl_route_add, NULL);
> +    unixctl_command_register("tnl/route/show", "", 0, 0, tnl_route_show, NULL);
> +    unixctl_command_register("tnl/route/del", "ip mask", 2, 2, tnl_route_del, NULL);
> +    classifier_init(&cls, NULL);

Just to be on the safe side, I’d init classifier first, then register the commands.

> +}
> diff --git a/lib/tnl-router.h b/lib/tnl-router.h
> new file mode 100644
> index 0000000..f780121
> --- /dev/null
> +++ b/lib/tnl-router.h
> @@ -0,0 +1,42 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef OVS_TNL_ROUTER_H
> +#define OVS_TNL_ROUTER_H 1
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +#include <net/if.h>
> +
> +#include "packets.h"
> +#include "timeval.h"
> +#include "unixctl.h"
> +#include "util.h"
> +
> +#ifdef  __cplusplus
> +extern "C" {
> +#endif
> +
> +int tnl_route_lookup(ovs_be32 ip_dst, char out_dev[], ovs_be32 *gw);
> +void tnl_route_unixctl_register(void);
> +bool tnl_route_reconfigured(void);
> +bool tnl_route_check_dev(const char out_dev[]);
> +
> +#ifdef  __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/vlog.c b/lib/vlog.c
> index 42e4869..69b40b1 100644
> --- a/lib/vlog.c
> +++ b/lib/vlog.c
> @@ -935,6 +935,7 @@ vlog(const struct vlog_module *module, enum vlog_level level,
>     va_start(args, message);
>     vlog_valist(module, level, message, args);
>     va_end(args);
> +    fflush(NULL);

Have you considered the cost of this? 

> }
> 
> /* Logs 'message' to 'module' at maximum verbosity, then exits with a failure
> diff --git a/lib/vxlan.h b/lib/vxlan.h
> new file mode 100644
> index 0000000..0ae5a84
> --- /dev/null
> +++ b/lib/vxlan.h
> @@ -0,0 +1,46 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + *     http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef NETDEV_VXLAN_H
> +#define NETDEV_VXLAN_H 1
> +
> +#include <stddef.h>
> +#include <stdint.h>
> +#include "list.h"
> +#include "packets.h"
> +#include "util.h"
> +#include "netdev-dpdk.h"
> +
> +#ifdef  __cplusplus
> +extern "C" {
> +#endif
> +
> +#define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
> +
> +#define VXLAN_FLAGS 0x08000000  /* struct vxlanhdr.vx_flags required value. */
> +
> +/* VXLAN protocol header */
> +OVS_PACKED(
> +struct vxlanhdr {
> +        ovs_be32 vx_flags;
> +        ovs_be32 vx_vni;
> +});

Does this really need to be packed?

> +
> +#ifdef  __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index 880824d..9d41c30 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -17,7 +17,12 @@
> #include "ofproto/ofproto-dpif-xlate.h"
> 
> #include <errno.h>
> +#include <arpa/inet.h>
> +#include <net/if.h>
> +#include <sys/socket.h>
> +#include <netinet/in.h>
> 
> +#include "tnl-arp-cache.h"
> #include "bfd.h"
> #include "bitmap.h"
> #include "bond.h"
> @@ -48,7 +53,9 @@
> #include "ofproto/ofproto-dpif.h"
> #include "ofproto/ofproto-provider.h"
> #include "packet-dpif.h"
> +#include "tnl-router.h"
> #include "tunnel.h"
> +#include "tnl-ports.h"
> #include "vlog.h"
> 
> COVERAGE_DEFINE(xlate_actions);
> @@ -112,6 +119,11 @@ struct xbridge {
>     /* True if the datapath supports masked data in OVS_ACTION_ATTR_SET
>      * actions. */
>     bool masked_set_action;
> +
> +    /* True if tunnel route configure on this bridge. */\

“is configured”

> +    bool has_tunnel_port;
> +
> +    bool enable_userspace_tunneling;
> };
> 
> struct xbundle {
> @@ -374,7 +386,8 @@ static void xlate_xbridge_set(struct xbridge *, struct dpif *,
>                               bool enable_recirc,
>                               bool variable_length_userdata,
>                               size_t max_mpls_depth,
> -                              bool masked_set_action);
> +                              bool masked_set_action,
> +                              bool enable_userspace_tunneling);
> static void xlate_xbundle_set(struct xbundle *xbundle,
>                               enum port_vlan_mode vlan_mode, int vlan,
>                               unsigned long *trunks, bool use_priority_tags,
> @@ -440,7 +453,8 @@ xlate_xbridge_set(struct xbridge *xbridge,
>                   bool enable_recirc,
>                   bool variable_length_userdata,
>                   size_t max_mpls_depth,
> -                  bool masked_set_action)
> +                  bool masked_set_action,
> +                  bool enable_userspace_tunneling)
> {
>     if (xbridge->ml != ml) {
>         mac_learning_unref(xbridge->ml);
> @@ -492,6 +506,9 @@ xlate_xbridge_set(struct xbridge *xbridge,
>     xbridge->variable_length_userdata = variable_length_userdata;
>     xbridge->max_mpls_depth = max_mpls_depth;
>     xbridge->masked_set_action = masked_set_action;
> +    xbridge->has_tunnel_port = enable_userspace_tunneling &&
> +                               tnl_route_check_dev(xbridge->name);
> +    xbridge->enable_userspace_tunneling = enable_userspace_tunneling;
> }
> 
> static void
> @@ -574,7 +591,8 @@ xlate_xbridge_copy(struct xbridge *xbridge)
>                       xbridge->frag, xbridge->forward_bpdu,
>                       xbridge->has_in_band, xbridge->enable_recirc,
>                       xbridge->variable_length_userdata,
> -                      xbridge->max_mpls_depth, xbridge->masked_set_action);
> +                      xbridge->max_mpls_depth, xbridge->masked_set_action,
> +                      xbridge->enable_userspace_tunneling);
>     LIST_FOR_EACH (xbundle, list_node, &xbridge->xbundles) {
>         xlate_xbundle_copy(new_xbridge, xbundle);
>     }
> @@ -728,7 +746,7 @@ xlate_ofproto_set(struct ofproto_dpif *ofproto, const char *name,
>                   const struct netflow *netflow, enum ofp_config_flags frag,
>                   bool forward_bpdu, bool has_in_band, bool enable_recirc,
>                   bool variable_length_userdata, size_t max_mpls_depth,
> -                  bool masked_set_action)
> +                  bool masked_set_action, bool enable_userspace_tunneling)
> {
>     struct xbridge *xbridge;
> 
> @@ -749,7 +767,7 @@ xlate_ofproto_set(struct ofproto_dpif *ofproto, const char *name,
>                       rstp, ms, mbridge, sflow, ipfix, netflow, frag,
>                       forward_bpdu, has_in_band, enable_recirc,
>                       variable_length_userdata, max_mpls_depth,
> -                      masked_set_action);
> +                      masked_set_action, enable_userspace_tunneling);
> }
> 
> static void
> @@ -2133,6 +2151,10 @@ xlate_normal(struct xlate_ctx *ctx)
>         return;
>     }
> 
> +    if (ctx->xbridge->has_tunnel_port) {
> +        tnl_arp_snoop(flow, wc, ctx->xbridge->name);
> +    }
> +
>     /* Learn source MAC. */
>     if (ctx->xin->may_learn) {
>         update_learning_table(ctx->xbridge, flow, wc, vlan, in_xbundle);
> @@ -2469,6 +2491,120 @@ process_special(struct xlate_ctx *ctx, const struct flow *flow,
>     }
> }
> 
> +static int
> +tnl_route_lookup_flow(const struct flow *oflow,
> +                      ovs_be32 *ip, struct xport **out_port)
> +{
> +    char out_dev[IFNAMSIZ];
> +    struct xbridge *xbridge;
> +    struct xlate_cfg *xcfg;
> +    ovs_be32 gw;
> +    int res;
> +
> +    res = tnl_route_lookup(oflow->tunnel.ip_dst, out_dev, &gw);
> +    if (res) {
> +        return res;
> +    }
> +
> +    if (gw) {
> +        *ip = gw;
> +    } else {
> +        *ip = oflow->tunnel.ip_dst;
> +    }
> +
> +    xcfg = ovsrcu_get(struct xlate_cfg *, &xcfgp);
> +    ovs_assert(xcfg);
> +
> +    HMAP_FOR_EACH (xbridge, hmap_node, &xcfg->xbridges) {
> +        if (!strncmp(xbridge->name, out_dev, IFNAMSIZ)) {
> +            struct xport *port;
> +
> +            HMAP_FOR_EACH (port, ofp_node, &xbridge->xports) {
> +                if (!strncmp(netdev_get_name(port->netdev), out_dev, IFNAMSIZ)) {
> +                    *out_port = port;
> +                    return 0;
> +                }
> +            }
> +        }
> +    }

Could the tnl_route_lookup return the ‘port’ directly? Would be much cheaper.

> +    return -ENOENT;
> +}
> +
> +static int
> +xlate_recv_packet(struct xbridge *xbridge, struct ofpbuf *packet)

Should this be called slate_flood_packet() instead? Should we make xlate_send_packet() support flooding instead? E.g., NULL port -> flood?

> +{
> +    struct ofpact_output output;
> +    struct flow flow;
> +
> +    ofpact_init(&output.ofpact, OFPACT_OUTPUT, sizeof output);
> +    /* Use OFPP_NONE as the in_port to avoid special packet processing. */
> +    flow_extract(packet, NULL, &flow);
> +    flow.in_port.ofp_port = OFPP_NONE;
> +    output.port = OFPP_NORMAL;

Could this be OFPP_FLOOD instead?

> +    output.max_len = 0;
> +
> +    return ofproto_dpif_execute_actions(xbridge->ofproto, &flow, NULL,
> +                                        &output.ofpact, sizeof output,
> +                                        packet);
> +}
> +
> +
> +static void
> +tnl_arp_request(const struct xport *out_dev, const uint8_t eth_src[ETH_ADDR_LEN],
> +            ovs_be32 ip_src, ovs_be32 ip_dst)
> +{
> +    struct xbridge *xbridge = out_dev->xbridge;
> +    struct ofpbuf packet;
> +
> +    ofpbuf_init(&packet, 0);
> +    compose_arp(&packet, eth_src, ip_src, ip_dst);
> +
> +    xlate_recv_packet(xbridge, &packet);
> +    ofpbuf_uninit(&packet);
> +}
> +
> +static int
> +build_tunnel_send(const struct xlate_ctx *ctx, const struct xport *xport,
> +                  const struct flow *flow, odp_port_t tunnel_odp_port)
> +{
> +    struct xport *out_dev = NULL;
> +    struct ofpbuf tnl_header;
> +    char header_stub[256];
> +    ovs_be32 s_ip, d_ip = 0;
> +    uint8_t smac[ETH_ADDR_LEN];
> +    bool send_arp;
> +    int res;
> +
> +    res = tnl_route_lookup_flow(flow, &d_ip, &out_dev);
> +    if (res) {
> +        return res;
> +    }
> +
> +    /* Use mac addr of bridge port of the peer. */
> +    res = netdev_get_etheraddr(out_dev->netdev, smac);
> +    if (res) {
> +        return res;
> +    }
> +    res = netdev_get_in4(out_dev->netdev, (struct in_addr *) &s_ip, NULL);
> +    if (res) {
> +        return res;
> +    }
> +
> +    ofpbuf_use_stub(&tnl_header, header_stub, sizeof header_stub);
> +    res = tnl_port_build_header(xport->ofport, d_ip, flow,
> +                                out_dev->xbridge->name, smac, s_ip,
> +                                &tnl_header, &send_arp);
> +    if (res) {
> +        if (send_arp) {
> +            tnl_arp_request(out_dev, smac, s_ip, d_ip);
> +        }
> +        return res;
> +    }
> +    odp_put_build_tunnel_header_action(ctx->xout->odp_actions, &tnl_header,
> +                                       tunnel_odp_port, out_dev->odp_port);
> +    return 0;
> +}
> +
> static void
> compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>                         bool check_stp)
> @@ -2613,9 +2749,15 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>             entry->u.dev.tx = netdev_ref(xport->netdev);
>         }
>         out_port = odp_port;
> -        commit_odp_tunnel_action(flow, &ctx->base_flow,
> -                                 ctx->xout->odp_actions);
> -        flow->tunnel = flow_tnl; /* Restore tunnel metadata */
> +        if (ctx->xbridge->enable_userspace_tunneling) {
> +            build_tunnel_send(ctx, xport, flow, odp_port);
> +            goto out;
> +        } else {
> +            commit_odp_tunnel_action(flow, &ctx->base_flow,
> +                                     ctx->xout->odp_actions);
> +            flow->tunnel = flow_tnl; /* Restore tunnel metadata */
> +        }
> +
>     } else {
>         odp_port = xport->odp_port;
>         out_port = odp_port;
> @@ -2635,7 +2777,7 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>     if (out_port != ODPP_NONE) {
>         ctx->xout->slow |= commit_odp_actions(flow, &ctx->base_flow,
>                                               ctx->xout->odp_actions,
> -                                              &ctx->xout->wc,
> +                                              wc,
>                                               ctx->xbridge->masked_set_action);
> 
>         if (ctx->use_recirc) {
> @@ -2653,6 +2795,18 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
>             nl_msg_put_u32(ctx->xout->odp_actions, OVS_ACTION_ATTR_RECIRC,
>                            xr->recirc_id);
>         } else {
> +            /* XXX: Write better Filter for tunnel port. We can use inport
> +             * int tunnel-port flow to avoid these checks completely. */
> +            if (ofp_port == OFPP_LOCAL && ctx->xbridge->has_tunnel_port) {
> +                odp_port_t odp_tnl_port;
> +
> +                odp_tnl_port = tnl_port_lookup(flow, wc);
> +                if (odp_tnl_port != ODPP_NONE) {
> +                    nl_msg_put_odp_port(ctx->xout->odp_actions, OVS_ACTION_ATTR_TUNNEL_POP,
> +                                        odp_tnl_port);
> +                    goto out;
> +                }
> +            }
>             add_ipfix_output_action(ctx, out_port);
>             nl_msg_put_odp_port(ctx->xout->odp_actions, OVS_ACTION_ATTR_OUTPUT,
>                                 out_port);
> @@ -3013,9 +3167,9 @@ execute_controller_action(struct xlate_ctx *ctx, int len,
>                           enum ofp_packet_in_reason reason,
>                           uint16_t controller_id)
> {
> +    struct pkt_metadata md = PKT_METADATA_INITIALIZER(0);
>     struct ofproto_packet_in *pin;
>     struct dpif_packet *packet;
> -    struct pkt_metadata md = PKT_METADATA_INITIALIZER(0);
> 
>     ctx->xout->slow |= SLOW_CONTROLLER;
>     if (!ctx->xin->packet) {
> @@ -3023,13 +3177,13 @@ execute_controller_action(struct xlate_ctx *ctx, int len,
>     }
> 
>     packet = dpif_packet_clone_from_ofpbuf(ctx->xin->packet);
> +    packet->md = md;

dpif_packet_clone_from_ofpbuf() already zero-initializes the md so it need not be set again.

> 
>     ctx->xout->slow |= commit_odp_actions(&ctx->xin->flow, &ctx->base_flow,
>                                           ctx->xout->odp_actions,
>                                           &ctx->xout->wc,
>                                           ctx->xbridge->masked_set_action);
> 
> -    packet->md = md;
>     odp_execute_actions(NULL, &packet, 1, false,
>                         ofpbuf_data(ctx->xout->odp_actions),
>                         ofpbuf_size(ctx->xout->odp_actions), NULL);
> diff --git a/ofproto/ofproto-dpif-xlate.h b/ofproto/ofproto-dpif-xlate.h
> index 5ef20b1..6cf80ad 100644
> --- a/ofproto/ofproto-dpif-xlate.h
> +++ b/ofproto/ofproto-dpif-xlate.h
> @@ -154,7 +154,8 @@ void xlate_ofproto_set(struct ofproto_dpif *, const char *name,
>                        bool has_in_band, bool enable_recirc,
>                        bool variable_length_userdata,
>                        size_t mpls_label_stack_length,
> -                       bool masked_set_action);
> +                       bool masked_set_action,
> +                       bool enable_userspace_tunneling);
> void xlate_remove_ofproto(struct ofproto_dpif *);
> 
> void xlate_bundle_set(struct ofproto_dpif *, struct ofbundle *,
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index d965d38..f22ac9c 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -59,6 +59,7 @@
> #include "ofproto-dpif-upcall.h"
> #include "ofproto-dpif-xlate.h"
> #include "poll-loop.h"
> +#include "tnl-router.h"
> #include "seq.h"
> #include "simap.h"
> #include "smap.h"
> @@ -280,6 +281,9 @@ struct dpif_backer {
>     /* Maximum number of MPLS label stack entries that the datapath supports
>      * in a match */
>     size_t max_mpls_depth;
> +
> +    /* True if the datapath supports tnl_push and pop actions. */
> +    bool enable_userspace_tunneling;
> };
> 
> /* All existing ofproto_backer instances, indexed by ofproto->up.type. */
> @@ -581,7 +585,7 @@ type_run(const char *type)
> 
>                 iter->odp_port = node ? u32_to_odp(node->data) : ODPP_NONE;
>                 if (tnl_port_reconfigure(iter, iter->up.netdev,
> -                                         iter->odp_port)) {
> +                                         iter->odp_port, dp_port)) {
>                     backer->need_revalidate = REV_RECONFIGURE;
>                 }
>             }
> @@ -624,7 +628,8 @@ type_run(const char *type)
>                               ofproto->backer->enable_recirc,
>                               ofproto->backer->variable_length_userdata,
>                               ofproto->backer->max_mpls_depth,
> -                              ofproto->backer->masked_set_action);
> +                              ofproto->backer->masked_set_action,
> +                              ofproto->backer->enable_userspace_tunneling);
> 
>             HMAP_FOR_EACH (bundle, hmap_node, &ofproto->bundles) {
>                 xlate_bundle_set(ofproto, bundle, bundle->name,
> @@ -949,6 +954,9 @@ open_dpif_backer(const char *type, struct dpif_backer **backerp)
>     backer->masked_set_action = check_masked_set_action(backer);
>     backer->rid_pool = recirc_id_pool_create();
> 
> +    backer->enable_userspace_tunneling = dpif_is_userspace_tunneling_on() &&
> +                                         dpif_support_userspace_tunneling(backer->dpif);
> +
>     error = dpif_recv_set(backer->dpif, backer->recv_set_enable);
>     if (error) {
>         VLOG_ERR("failed to listen on datapath of type %s: %s",
> @@ -1225,6 +1233,7 @@ construct(struct ofproto *ofproto_)
>     guarded_list_init(&ofproto->pins);
> 
>     ofproto_dpif_unixctl_init();
> +    tnl_route_unixctl_register();
> 
>     hmap_init(&ofproto->vlandev_map);
>     hmap_init(&ofproto->realdev_vid_map);
> @@ -1517,6 +1526,9 @@ run(struct ofproto *ofproto_)
>         }
>     }
> 
> +    if (tnl_route_reconfigured()) {
> +        ofproto->backer->need_revalidate = REV_MCAST_SNOOPING;
> +    }
>     return 0;
> }
> 
> @@ -1668,7 +1680,7 @@ port_construct(struct ofport *port_)
>     port->odp_port = dpif_port.port_no;
> 
>     if (netdev_get_tunnel_config(netdev)) {
> -        tnl_port_add(port, port->up.netdev, port->odp_port);
> +        tnl_port_add(port, port->up.netdev, port->odp_port, namebuf);
>         port->is_tunnel = true;
>         if (ofproto->ipfix) {
>            dpif_ipfix_add_tunnel_port(ofproto->ipfix, port_, port->odp_port);
> @@ -1759,24 +1771,28 @@ static void
> port_modified(struct ofport *port_)
> {
>     struct ofport_dpif *port = ofport_dpif_cast(port_);
> +    char namebuf[NETDEV_VPORT_NAME_BUFSIZE];
> +    struct netdev *netdev = port->up.netdev;
> 
>     if (port->bundle && port->bundle->bond) {
> -        bond_slave_set_netdev(port->bundle->bond, port, port->up.netdev);
> +        bond_slave_set_netdev(port->bundle->bond, port, netdev);
>     }
> 
>     if (port->cfm) {
> -        cfm_set_netdev(port->cfm, port->up.netdev);
> +        cfm_set_netdev(port->cfm, netdev);
>     }
> 
>     if (port->bfd) {
> -        bfd_set_netdev(port->bfd, port->up.netdev);
> +        bfd_set_netdev(port->bfd, netdev);
>     }
> 
>     ofproto_dpif_monitor_port_update(port, port->bfd, port->cfm,
>                                      port->up.pp.hw_addr);
> 
> -    if (port->is_tunnel && tnl_port_reconfigure(port, port->up.netdev,
> -                                                port->odp_port)) {
> +    netdev_vport_get_dpif_port(netdev, namebuf, sizeof namebuf);
> +
> +    if (port->is_tunnel && tnl_port_reconfigure(port, netdev,
> +                                                port->odp_port, namebuf)) {
>         ofproto_dpif_cast(port->up.ofproto)->backer->need_revalidate =
>             REV_RECONFIGURE;
>     }
> @@ -3462,8 +3478,10 @@ ofproto_dpif_execute_actions(struct ofproto_dpif *ofproto,
>     struct xlate_in xin;
>     ofp_port_t in_port;
>     struct dpif_execute execute;
> +    struct ds ds;
>     int error;
> 
> +    ds_init(&ds);
>     ovs_assert((rule != NULL) != (ofpacts != NULL));
> 
>     dpif_flow_stats_extract(flow, packet, time_msec(), &stats);
> @@ -3481,6 +3499,10 @@ ofproto_dpif_execute_actions(struct ofproto_dpif *ofproto,
> 
>     execute.actions = ofpbuf_data(xout.odp_actions);
>     execute.actions_len = ofpbuf_size(xout.odp_actions);
> +
> +    format_odp_actions(&ds, ofpbuf_data(xout.odp_actions),
> +                           ofpbuf_size(xout.odp_actions));
> +

These additions look like dead code now.

>     execute.packet = packet;
>     execute.md = pkt_metadata_from_flow(flow);
>     execute.needs_help = (xout.slow & SLOW_ACTION) != 0;
> diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
> index 46b0719..120af46 100644
> --- a/ofproto/tunnel.c
> +++ b/ofproto/tunnel.c
> @@ -17,19 +17,25 @@
> 
> #include <errno.h>
> 
> +#include "tnl-arp-cache.h"
> #include "byte-order.h"
> #include "connectivity.h"
> +#include "dpif.h"
> #include "dynamic-string.h"
> #include "hash.h"
> #include "hmap.h"
> #include "netdev.h"
> #include "odp-util.h"
> +#include "ofpbuf.h"
> #include "packets.h"
> #include "seq.h"
> #include "smap.h"
> #include "socket-util.h"
> #include "tunnel.h"
> +#include "tnl-ports.h"
> #include "vlog.h"
> +#include "unaligned.h"
> +#include "ofproto-dpif.h"
> 
> VLOG_DEFINE_THIS_MODULE(tunnel);
> 
> @@ -124,7 +130,7 @@ static void tnl_port_del__(const struct ofport_dpif *) OVS_REQ_WRLOCK(rwlock);
> 
> static bool
> tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
> -               odp_port_t odp_port, bool warn)
> +               odp_port_t odp_port, bool warn, const char name[])
>     OVS_REQ_WRLOCK(rwlock)
> {
>     const struct netdev_tunnel_config *cfg;
> @@ -173,6 +179,12 @@ tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
>     }
>     hmap_insert(*map, &tnl_port->match_node, tnl_hash(&tnl_port->match));
>     tnl_port_mod_log(tnl_port, "adding");
> +
> +    if (dpif_is_userspace_tunneling_on()) {
> +        /* XXX: Add generic API to handle other tunneling protocols. */
> +        tnl_port_insert_udp(odp_port, tnl_port->match.ip_dst,
> +                               cfg->dst_port, name);
> +    }
>     return true;
> }
> 
> @@ -181,10 +193,10 @@ tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
>  * tunnel. */
> void
> tnl_port_add(const struct ofport_dpif *ofport, const struct netdev *netdev,
> -             odp_port_t odp_port) OVS_EXCLUDED(rwlock)
> +             odp_port_t odp_port, const char name[]) OVS_EXCLUDED(rwlock)
> {
>     ovs_rwlock_wrlock(&rwlock);
> -    tnl_port_add__(ofport, netdev, odp_port, true);
> +    tnl_port_add__(ofport, netdev, odp_port, true, name);
>     ovs_rwlock_unlock(&rwlock);
> }
> 
> @@ -194,7 +206,8 @@ tnl_port_add(const struct ofport_dpif *ofport, const struct netdev *netdev,
>  * tnl_port_add(). */
> bool
> tnl_port_reconfigure(const struct ofport_dpif *ofport,
> -                     const struct netdev *netdev, odp_port_t odp_port)
> +                     const struct netdev *netdev, odp_port_t odp_port,
> +                     const char name[])
>     OVS_EXCLUDED(rwlock)
> {
>     struct tnl_port *tnl_port;
> @@ -203,13 +216,13 @@ tnl_port_reconfigure(const struct ofport_dpif *ofport,
>     ovs_rwlock_wrlock(&rwlock);
>     tnl_port = tnl_find_ofport(ofport);
>     if (!tnl_port) {
> -        changed = tnl_port_add__(ofport, netdev, odp_port, false);
> +        changed = tnl_port_add__(ofport, netdev, odp_port, false, name);
>     } else if (tnl_port->netdev != netdev
>                || tnl_port->match.odp_port != odp_port
>                || tnl_port->change_seq != seq_read(connectivity_seq_get())) {
>         VLOG_DBG("reconfiguring %s", tnl_port_get_name(tnl_port));
>         tnl_port_del__(ofport);
> -        tnl_port_add__(ofport, netdev, odp_port, true);
> +        tnl_port_add__(ofport, netdev, odp_port, true, name);
>         changed = true;
>     }
>     ovs_rwlock_unlock(&rwlock);
> @@ -227,8 +240,11 @@ tnl_port_del__(const struct ofport_dpif *ofport) OVS_REQ_WRLOCK(rwlock)
> 
>     tnl_port = tnl_find_ofport(ofport);
>     if (tnl_port) {
> +        const struct netdev_tunnel_config *cfg =
> +            netdev_get_tunnel_config(tnl_port->netdev);
>         struct hmap **map;
> 
> +        tnl_port_delete(tnl_port->match.ip_dst, cfg->dst_port);
>         tnl_port_mod_log(tnl_port, "removing");
>         map = tnl_match_map(&tnl_port->match);
>         hmap_remove(*map, &tnl_port->match_node);
> @@ -639,3 +655,54 @@ tnl_port_get_name(const struct tnl_port *tnl_port) OVS_REQ_RDLOCK(rwlock)
> {
>     return netdev_get_name(tnl_port->netdev);
> }
> +
> +int
> +tnl_port_build_header(const struct ofport_dpif *ofport, ovs_be32 router_ip,
> +                      const struct flow *tnl_flow,
> +                      const char br_name[], uint8_t smac[ETH_ADDR_LEN],
> +                      ovs_be32 ip_src, struct ofpbuf *header,
> +                      bool *send_arp)
> +{
> +    struct tnl_port *tnl_port;
> +    uint8_t dmac[ETH_ADDR_LEN];
> +    struct eth_header *eth;
> +    struct ip_header *ip;
> +    int res;
> +
> +    *send_arp = false;
> +    ovs_rwlock_rdlock(&rwlock);
> +    tnl_port = tnl_find_ofport(ofport);
> +    ovs_assert(tnl_port);
> +
> +    res = tnl_arp_lookup(br_name, router_ip, dmac);
> +    if (res) {
> +        *send_arp = true;
> +        goto out;
> +    }
> +
> +    /* Build Ethernet and IP headers. */
> +    ofpbuf_clear(header);
> +
> +    eth = ofpbuf_put_uninit(header, sizeof *eth);
> +    memcpy(eth->eth_dst, dmac, ETH_ADDR_LEN);
> +    memcpy(eth->eth_src, smac, ETH_ADDR_LEN);
> +    eth->eth_type = htons(ETH_TYPE_IP);
> +
> +    ip = ofpbuf_put_zeros(header, sizeof *ip);
> +    ip->ip_ihl_ver = IP_IHL_VER(5, 4);
> +    ip->ip_tos = tnl_flow->tunnel.ip_tos;
> +    ip->ip_ttl = tnl_flow->tunnel.ip_ttl;
> +    ip->ip_proto = IPPROTO_UDP;
> +    ip->ip_frag_off = (tnl_flow->tunnel.flags & FLOW_TNL_F_DONT_FRAGMENT) ?
> +                      htons(IP_DF) : 0;
> +
> +    put_16aligned_be32(&ip->ip_src, ip_src);
> +    put_16aligned_be32(&ip->ip_dst, tnl_flow->tunnel.ip_dst);
> +    ip->ip_csum = 0;
> +
> +    netdev_build_header(tnl_port->netdev, header);
> +out:
> +    ovs_rwlock_unlock(&rwlock);
> +
> +    return res;
> +}
> diff --git a/ofproto/tunnel.h b/ofproto/tunnel.h
> index 27a2f7d..611c139 100644
> --- a/ofproto/tunnel.h
> +++ b/ofproto/tunnel.h
> @@ -29,10 +29,10 @@ struct ofport_dpif;
> struct netdev;
> 
> bool tnl_port_reconfigure(const struct ofport_dpif *, const struct netdev *,
> -                          odp_port_t);
> +                          odp_port_t, const char name[]);
> 
> void tnl_port_add(const struct ofport_dpif *, const struct netdev *,
> -                  odp_port_t odp_port);
> +                  odp_port_t odp_port, const char name[]);
> void tnl_port_del(const struct ofport_dpif *);
> 
> const struct ofport_dpif *tnl_port_receive(const struct flow *);
> @@ -48,4 +48,10 @@ tnl_port_should_receive(const struct flow *flow)
>     return flow->tunnel.ip_dst != 0;
> }
> 
> +int tnl_port_build_header(const struct ofport_dpif *ofport, ovs_be32 router_ip,
> +                          const struct flow *tnl_flow,
> +                          const char br_name[], uint8_t smac[ETH_ADDR_LEN],
> +                          ovs_be32 ip_src, struct ofpbuf *header,
> +                          bool *send_arp);
> +
> #endif /* tunnel.h */
> diff --git a/rhel/openvswitch.spec.in b/rhel/openvswitch.spec.in
> index 56b7404..b63f00c 100644
> --- a/rhel/openvswitch.spec.in
> +++ b/rhel/openvswitch.spec.in
> @@ -175,6 +175,6 @@ exit 0
> /usr/share/openvswitch/vswitch.ovsschema
> /usr/share/openvswitch/vtep.ovsschema
> %doc COPYING DESIGN INSTALL.SSL NOTICE README.md WHY-OVS FAQ NEWS
> -%doc INSTALL.DPDK rhel/README.RHEL
> +%doc INSTALL.DPDK rhel/README.RHEL README-Userspace-Tunneling
> /var/lib/openvswitch
> /var/log/openvswitch
> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
> index 6e4929c..56e4a91 100644
> --- a/vswitchd/ovs-vswitchd.8.in
> +++ b/vswitchd/ovs-vswitchd.8.in
> @@ -82,6 +82,10 @@ Some systems do not support \fBmlockall()\fR at all, and other systems
> only allow privileged users, such as the superuser, to use it.
> \fBovs\-vswitchd\fR emits a log message if \fBmlockall()\fR is
> unavailable or unsuccessful.
> +.IP "\fB\-\-userspace-tunneling\fR"
> +Enable userspace tunneling in \fBovs\-vswitchd\fR for userspace datapath. User
> +need to do special configuration to make it work. For more info refer to
> +README-Userspace-Tunneling.
> .
> .SS "DPDK Options"
> .IP "\fB\-\-dpdk\fR"
> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 3c82f0f..b90a592 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -149,6 +149,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>         OPT_DISABLE_SYSTEM,
>         DAEMON_OPTION_ENUMS,
>         OPT_DPDK,
> +        OPT_ENABLE_USER_TUNNEL,
>     };
>     static const struct option long_options[] = {
>         {"help",        no_argument, NULL, 'h'},
> @@ -163,6 +164,7 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>         {"enable-dummy", optional_argument, NULL, OPT_ENABLE_DUMMY},
>         {"disable-system", no_argument, NULL, OPT_DISABLE_SYSTEM},
>         {"dpdk", required_argument, NULL, OPT_DPDK},
> +        {"userspace-tunneling", no_argument, NULL, OPT_ENABLE_USER_TUNNEL},
>         {NULL, 0, NULL, 0},
>     };
>     char *short_options = long_options_to_short_options(long_options);
> @@ -211,6 +213,10 @@ parse_options(int argc, char *argv[], char **unixctl_pathp)
>             dp_blacklist_provider("system");
>             break;
> 
> +        case OPT_ENABLE_USER_TUNNEL:
> +            dpif_enable_userspace_tunneling();
> +            break;
> +

IMO we should use probing instead of explicit configuration parameter.

   Jarno


>         case '?':
>             exit(EXIT_FAILURE);
> 
> -- 
> 1.9.3
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev




More information about the dev mailing list