[ovs-dev] OVS Micro Summit Notes

Thu Oct 16 15:57:40 UTC 2014

Yesterday, we held a very productive OVS meeting as part of the Linux
Plumber's Conference. Below are the notes that were taken to record
the meeting. Thanks to all that participated!

============================================================

OVS Micro Summit 2014, Oct 15, 2014, Düsseldorf

Attendees:

Johann Tönsing (Netronome), Simon Horman (Netronome), Rob Truesdell
(Netronome), John Fastabend (Intel), Or Gerlitz (Mellanox)

Lori Jakab (Cisco), Jiri Pirko (Red Hat), Alexei Starovoitov
(Plumgrid), Zoltan Lajos Kis (Ericsson), Justin Pettit (VMware), Jesse
Gross (VMware)

Thomas Graf (Noiro/Cisco), Daniel Borkmann (Red Hat), Jiri Benc (Red
Hat), Dan Dumitriu (Midokura), Guillermo Ontanon (Midokura), Jiri
Pirko (Red Hat)

Thomas Bachman (Noiro/Cisco), Roni (Mellanox)

Agenda

What problems do you face today?

Datapath and vanilla kernel out of sync

Where to put the primary datapath repo?

Copy both ovs-dev and netdev on datapath patch proposals

=> Many agree with this suggestion - no brainer / at least do this

Patches need to be on appropriate repo otherwise people won't look at them

Do development of the kernel part in consultation primarily with
netdev?  Red Hat prefers this approach - RHEL can't take code from
other repos

May be problematic due to userspace being so large in OVS

Is the userspace part however critical when proposing patches? or is
API to userspace sufficient to enable evaluating changes?

OVS repo contains many backports - keep them there?

Many prefer compat framework for maintaining the backports

What are the gaps currently?

MPLS waiting for net-next opening, ready to be merged

LISP not ready

Negative feedback received on some patches

Multiple user space implementations including prop ones

Official userspace datapath on openvswitch.org is used on BSD, DPDK
and for testing

Intel OVDK merging with ovs.org

Prop user space datapaths remain

Hardware offload -

No interest from VMware to maintain ABI - would be based on source

For Hyper-V this would be aligned with netlink

Offload on flattened flows should be a configurable operation

Partial / selective offload is a must as limited hardware is a given

Where to put the logic to decide what can be offloaded?

User space seems a logic choice to not overload the kernel with complexity

Even those who only use OVS kernel code agree too much kernel
complexity is not advisable

Perhaps deploy a new entity interfacing to acceleration - different to
OVS kernel + OVS user space - to accommodate those with different
userspace?

(Some even are considering hosting this on the controller)

Ability must be provided to fall back to both kernel and user space

Offload must be possible even in future setups where hardware and
software datapath do not share a common model (e.g. current cached
flow table vs. P4)

Proposals on a kernel offload API:

Jiri Pirko: SWDEV to abstract representation of hardware switches as
net devices with additional NDOs to allow offloading flows

Offload decision is hooked into the OVS kernel datapath, i.e. OVS
calls into the NDO hooks directly.

John Fastabend (Intel): Move the policy decision to user space and
provide a Netlink interface to export hardware capabilities to user
space and allow user space to inject flows into the hardware using a
common API

How do you model the hardware? Capabilities exported as graphs

Patches posted to John's github page

https://github.com/jrfastab/flow-net-next

Netronome: implemented three acceleration options (seeing use cases
requiring each of these)

usermode / accelerator with entire datapath offloaded (ofproto level hooks)

usermode / accelerator with traffic sent for fallback processing to
usermode (ofproto level hooks)

usermode / kernel / accelerator with traffic sent for fallback
processing to kernel then to usermode (kernel hooks for flow
insertion/deletion and vport add/remove)

Agreement on merging Jiri's and John's proposal into a single generic
Netlink based offload API

Netlink as-is likely too slow to handle both jumbo frames to user
space and high volume flow updates

Memory mapped netlink has just been removed

May need an async message option

Netronome implemented control message based transport from userspace
directly to acceleration hardware - async control msg model

Intel developed a thin netlink layer mapping to messages close to
those required by hardware

Difficult to encode capabilities of diverse hardware platforms

May need to encode capabilities as pluggable code, not just data

Options to model + proposed sequence in which to implement

Easiest: model each hardware / software entity as separate virtual
switch, connect these by internal ports over which packets (without
OpenFlow metadata) flow

More difficult: split at table level - as well as QoS / similar size
major blocks; each table implementable by different hardware /
software instances - need to convey OpenFlow metadata

Hardest: permit each action in action list to be implemented by
different entity - difficult to e.g. hand off OpenFlow / OVS register
contents etc.

Resistance to extending the kernel packet structure with additional metadata

How will userspace know which table fields / match options (exact vs
wildcard)  / actions etc. will be employed - to enable it to employ
the most efficient model with sufficient semantics supported by
hardware?

Table features vs. TTPs vs. OVSDB / OpenStack etc extension etc.

Unclear how it will know - assume for our purposes it will know - but
may need to backtrack if got it wrong

Security Updates for OVS

New mailing list for 0day incidents

Status updates (see slides in dropbox)

https://www.dropbox.com/s/t1ikm6ij06z80ex/LoriJakab_OVSMicroConference.pdf?dl=0

LISP (Lori Jakab)

LISP implemented by border routers

Avoids each leaf node in system needing to be in each router

RFCs specify use of LISP for overlays (see slides for details)

One use case is maintaining connectivity for mobile hosts

No OpenFlow support - potential relevant tickets:

EXT-112 (making good progress) + EXT-382 (not guaranteed to proceed -
prototyping stalled + controversial, might be replaced with different
protocol independent layer

Working on separate LISP kernel module - analogous to existing
GRE/VXLAN modules to which OVS now interfaces

Was not accepted because it is another route cache

More generic encap mechanism - Generic Protocol Extension - can be
leveraged for LISP

Dislike different next protocol field - would prefer just Ethertype

GENEVE could also be used (only L2 inner has been implemented but
could be L3 according to draft)

By setting various GPE bits e.g. flags to zero a valid LISP packet results

Conntrack / NAT / Crypto

http://openvswitch.org/slides/OpenStack-140513.pdf

Conntrack

No capability to track/share state between packets and use connection tracking

Existing method to implement reflexive ACL was to use the learn action
to learn a flow in the opposite direction. Minimal security guarantees
and performance issue

More recently, support for tcp_flags matching was added allowing to
match on ACK bits - still does not handle out of TCP window packets

The idea is to use the existing netfilter connection tracking instead
and allow storing/retrieving state

The feature includes a new action conntrack() allowing to feed packets
to the conntrack and a conn_state to match on connection state

Patch supplied for feedback - then integrated zone support from Thomas G.

Still need to enhance the code to send fragmented packets through frag
handling code

Upper performance bound around 6Gbit when going through netfilter

New vendor extension action specifies zone and whether or not to recirculate

Interest exists to supply a compatible userspace implementation (e.g.
PF based or different) - even potentially accelerated - however
interface needs to be clean enough for this (not just mirror Linux's
interface)

Metadata handling needs to be considered

NAT

Thomas posted patch to add NAT action assuming connection tracking
state already exists

Would support stateful NAT (translate L4 ports)

Tricky to handle bidirectional traffic

Easiest is to position on one "side" of the switch e.g. at egress to
public port / at ingress from public port - to ensure OpenFlow always
sees private addresses

Initially no need to expose table contents to OpenFlow therefore OK to
expose as NAT + un-NAT actions deployable at separate places in packet
processing pipeline

Expose as synchronized tables once controller vendors want to see
contents of NAT tables

Again Linux kernel datapath only

Crypto (IPsec)

Similar to conntrack - currently kernel feature, would be missing in
userspace etc.

Prefer to have a mechanism to deploy in userspace, kernel and accelerators

Question is whether OK to keep outside OpenFlow vs. whether parts?
all? needs to be exposed to OpenFlow

eBPF based datapath (Alexei S. @ PLUMgrid)

Current focus of eBPF is on tracing

Motivation to integrate eBPF with the OVS kernel datpath

Provides additional programmability similar to P4 vs OF 1.x - some
semantics e.g. complex logic difficult to express using tables

(Separate) Use cases:

Non OpenFlow - e.g. potentially traditional networking - flexibly
deploy e.g. L2 with learning etc.

High level optimization similar to nftables

Possible approach to protocol independent parsing exposed through OF

Can act as glue between tables, does not need to replace matching in
particular tables

Could also replace / encode  parsing logic

Incremental parsing (for improved performance) vs up front complete parsing

To reconcile this with option 1... can use a BPF based pre-flow-table
option (for parsing only) and post-flow-table option (further
processing)

If table is empty can further optimize this by having pre and post be
replaced with simpler unified one

Is this needed though - if we ignore PIF type usage?  Could for
example use this to obtain TCP window sizes for analytics...

Q: Does this need to integrate with with OVS?  Why not just hook into
ingress at netdev?  A: More options available w.r.t. where to divert
traffic when integrated.

Option 1a: Keep existing megaflow hash tables and call eBPF on flow miss

Option 1b: eBPF as an action

Consensus that this is the easiest to implement

BPF program is provided by user space (not necessarily exposed to
controller - initially not)

Could provide an easy angle for new actions without requiring to go
through the heavy process of adding a new datapath action

Not as flexible as C, which is good, as can potentially compile to
certain hardware platforms

Option 2: Replace full lookup & execution with BPF code

Potential Option 4: Table matches fields, additional table column
contains expression which also needs to be matched (here tables are
main control logic, expression is add on to each row)

Expressiveness: limited execution time run to completion (no loops);
can call out to functions (which could be implemented in hardware or
software)

Conceptually a program is set of connected netdevs, each with multiple
ports, which are connected in some topology; can collapse some nodes
into fewer for improved performance

Potential concerns on compatibility with existing ABI and requirement
on maintaining two parallel datapath implementations going forward
(flow lookup and BPF)

Would need to keep the old configuration ABI. Possibly provide compat
through a BPF program.  Initially retain existing C code for parsing
as faster anyway.

Can't break userspace if default behavior remains unchanged as
userspace would know whether a program / which program has been
downloaded

Need to constrain which kernel functions are permissible to call -
e.g. output to port, add header, compute checksums

Especially important when permitting userspace and accelerated target
platforms too

Take care not to disrupt existing GSO checksum handling e.g. related
metadata / flags / offsets prepended to packets, or explicitly permit
these to be set

Code can be made availble after a rebase onto the latest BPF changes

Exposing the idea of BPF to the controller opens a new set of questions

Conceptually need overall control flow mechanism (around say OVS,
IPsec, QoS etc), and a detailed packet manipulation mechanism - need
to decide which of these eBPF will perform (only detailed vs both...)
and how to expose this

Would determine where to hook it in and which people need to be
involved (OVS vs general Linux community vs. OpenFlow etc.)

Steps forward:

0. Add BPF program invocation to sockets

0.5 Add Extend cls_bpf with eBPF capability ( daniel will take care ;))

1. Add read-only BPF program as actions to OVS - used for convenience
of userspace - not exposed to OpenFlow (not even as custom action)

2. Enable programs to write to packets and forward packets to ports -
again initially not exposed to OpenFlow

3. Add ABI to handle encapsulation w/ offload

4. Add possibility to run BPF program on flow miss

M. Implement only in userspace without kernel... e.g. on DPDK (for
some value of N... depends on market demand)

Enables BPF programs to be exposed - e.g. downloaded by controller -
and running on the various available hardware / software platforms

N. Implementations for acceleration hardware

N+1.eBPF only OVS data path

Further discussion of the "outer control flow" in ONF Forwarding
Abstractions WG, and of Protocol Independent Forwarding part in ONF
PIF open source project

Schedule follow up discussions on next meetups

Start advertising the idea on blog

OpenFlow API for encap metadata

Geneve and other encap protocol introduces metadata (options conveyed
in packets)

The question is how to expose this metadata with OpenFlow

Considering passing through to userspace and beyond as opaque values somehow

GENEVE type space is large - would consume entire OXM space => need to
extend OXM class

Recently experimenter OXM ID space size was reduced - see
https://rs.opennetworking.org/bugs/browse/EXT-380 and ensuing
discussion

Nevertheless could use experimenter OXM encoding for this - use a
dedicated experimenter ID

Desire to handle properitary encap protocols with metadata in a way
that allows mapping to Geneve TLVs in the future

An eBPF converter to map generic tunnel metadata to specific protocol
headers would provide sufficient flexibility

Issues are representing this within a switch, accessible via matching
/ actions, and across the network, as a lighter weight than packet
in/out but more expressive than tunnel format

Zoltan - packet processors - https://rs.opennetworking.org/bugs/browse/EXT-122

Examples of issues with existing logical port scheme: cannot chain
these, cannot perform variable actions if MTU exceeded or not etc.

Therefore need more flexible mechanism

Can perform opaque operation in ASIC, or pipe to control processor to
perform it, and back

Invoke these via experimenter IDs

See also tasks proposal - slides attached to
https://rs.opennetworking.org/bugs/browse/EXT-494 - this refactors
action set/list, actions vs instructions, flow vs group vs egress
tables etc.

See also protocol independent forwarding - would have built in actions
/ functions as well as external names opaque functions which can be
invoked

Other potential features to work on

No major wish list items for OpenFlow control protocol level since
more recent OpenFlow versions have been implemented

QoS / metering issues: accuracy of implementations (easier to achieve
with hardware than software), representaton in OpenFlow / OF-Config
poorly defined

John to provide RFC patchset to allow hardware offload of TBF per
queue and eventually HTB for flat hierarchy

Consider deriving abstraction from the various implementations - then
define generic way to expose to OpenFlow / OF-Config / OVSDB etc.

Consensus on organizing meetups like this again in the future

Perhaps paste wishlist items into a document