[ovs-dev] OVS Micro Summit Notes
Jesse Gross
jesse at nicira.com
Thu Oct 16 15:57:40 UTC 2014
Yesterday, we held a very productive OVS meeting as part of the Linux
Plumber's Conference. Below are the notes that were taken to record
the meeting. Thanks to all that participated!
============================================================
OVS Micro Summit 2014, Oct 15, 2014, Düsseldorf
Attendees:
Johann Tönsing (Netronome), Simon Horman (Netronome), Rob Truesdell
(Netronome), John Fastabend (Intel), Or Gerlitz (Mellanox)
Lori Jakab (Cisco), Jiri Pirko (Red Hat), Alexei Starovoitov
(Plumgrid), Zoltan Lajos Kis (Ericsson), Justin Pettit (VMware), Jesse
Gross (VMware)
Thomas Graf (Noiro/Cisco), Daniel Borkmann (Red Hat), Jiri Benc (Red
Hat), Dan Dumitriu (Midokura), Guillermo Ontanon (Midokura), Jiri
Pirko (Red Hat)
Thomas Bachman (Noiro/Cisco), Roni (Mellanox)
Agenda
What problems do you face today?
Datapath and vanilla kernel out of sync
Where to put the primary datapath repo?
Copy both ovs-dev and netdev on datapath patch proposals
=> Many agree with this suggestion - no brainer / at least do this
Patches need to be on appropriate repo otherwise people won't look at them
Do development of the kernel part in consultation primarily with
netdev? Red Hat prefers this approach - RHEL can't take code from
other repos
May be problematic due to userspace being so large in OVS
Is the userspace part however critical when proposing patches? or is
API to userspace sufficient to enable evaluating changes?
OVS repo contains many backports - keep them there?
Many prefer compat framework for maintaining the backports
What are the gaps currently?
MPLS waiting for net-next opening, ready to be merged
LISP not ready
Negative feedback received on some patches
Multiple user space implementations including prop ones
Official userspace datapath on openvswitch.org is used on BSD, DPDK
and for testing
Intel OVDK merging with ovs.org
Prop user space datapaths remain
Hardware offload -
No interest from VMware to maintain ABI - would be based on source
For Hyper-V this would be aligned with netlink
Offload on flattened flows should be a configurable operation
Partial / selective offload is a must as limited hardware is a given
Where to put the logic to decide what can be offloaded?
User space seems a logic choice to not overload the kernel with complexity
Even those who only use OVS kernel code agree too much kernel
complexity is not advisable
Perhaps deploy a new entity interfacing to acceleration - different to
OVS kernel + OVS user space - to accommodate those with different
userspace?
(Some even are considering hosting this on the controller)
Ability must be provided to fall back to both kernel and user space
Offload must be possible even in future setups where hardware and
software datapath do not share a common model (e.g. current cached
flow table vs. P4)
Proposals on a kernel offload API:
Jiri Pirko: SWDEV to abstract representation of hardware switches as
net devices with additional NDOs to allow offloading flows
Offload decision is hooked into the OVS kernel datapath, i.e. OVS
calls into the NDO hooks directly.
John Fastabend (Intel): Move the policy decision to user space and
provide a Netlink interface to export hardware capabilities to user
space and allow user space to inject flows into the hardware using a
common API
How do you model the hardware? Capabilities exported as graphs
Patches posted to John's github page
https://github.com/jrfastab/flow-net-next
Netronome: implemented three acceleration options (seeing use cases
requiring each of these)
usermode / accelerator with entire datapath offloaded (ofproto level hooks)
usermode / accelerator with traffic sent for fallback processing to
usermode (ofproto level hooks)
usermode / kernel / accelerator with traffic sent for fallback
processing to kernel then to usermode (kernel hooks for flow
insertion/deletion and vport add/remove)
Agreement on merging Jiri's and John's proposal into a single generic
Netlink based offload API
Netlink as-is likely too slow to handle both jumbo frames to user
space and high volume flow updates
Memory mapped netlink has just been removed
May need an async message option
Netronome implemented control message based transport from userspace
directly to acceleration hardware - async control msg model
Intel developed a thin netlink layer mapping to messages close to
those required by hardware
Difficult to encode capabilities of diverse hardware platforms
May need to encode capabilities as pluggable code, not just data
Options to model + proposed sequence in which to implement
Easiest: model each hardware / software entity as separate virtual
switch, connect these by internal ports over which packets (without
OpenFlow metadata) flow
More difficult: split at table level - as well as QoS / similar size
major blocks; each table implementable by different hardware /
software instances - need to convey OpenFlow metadata
Hardest: permit each action in action list to be implemented by
different entity - difficult to e.g. hand off OpenFlow / OVS register
contents etc.
Resistance to extending the kernel packet structure with additional metadata
How will userspace know which table fields / match options (exact vs
wildcard) / actions etc. will be employed - to enable it to employ
the most efficient model with sufficient semantics supported by
hardware?
Table features vs. TTPs vs. OVSDB / OpenStack etc extension etc.
Unclear how it will know - assume for our purposes it will know - but
may need to backtrack if got it wrong
Security Updates for OVS
New mailing list for 0day incidents
Status updates (see slides in dropbox)
https://www.dropbox.com/s/t1ikm6ij06z80ex/LoriJakab_OVSMicroConference.pdf?dl=0
LISP (Lori Jakab)
LISP implemented by border routers
Avoids each leaf node in system needing to be in each router
RFCs specify use of LISP for overlays (see slides for details)
One use case is maintaining connectivity for mobile hosts
No OpenFlow support - potential relevant tickets:
EXT-112 (making good progress) + EXT-382 (not guaranteed to proceed -
prototyping stalled + controversial, might be replaced with different
protocol independent layer
Working on separate LISP kernel module - analogous to existing
GRE/VXLAN modules to which OVS now interfaces
Was not accepted because it is another route cache
More generic encap mechanism - Generic Protocol Extension - can be
leveraged for LISP
Dislike different next protocol field - would prefer just Ethertype
GENEVE could also be used (only L2 inner has been implemented but
could be L3 according to draft)
By setting various GPE bits e.g. flags to zero a valid LISP packet results
Conntrack / NAT / Crypto
http://openvswitch.org/slides/OpenStack-140513.pdf
Conntrack
No capability to track/share state between packets and use connection tracking
Existing method to implement reflexive ACL was to use the learn action
to learn a flow in the opposite direction. Minimal security guarantees
and performance issue
More recently, support for tcp_flags matching was added allowing to
match on ACK bits - still does not handle out of TCP window packets
The idea is to use the existing netfilter connection tracking instead
and allow storing/retrieving state
The feature includes a new action conntrack() allowing to feed packets
to the conntrack and a conn_state to match on connection state
Patch supplied for feedback - then integrated zone support from Thomas G.
Still need to enhance the code to send fragmented packets through frag
handling code
Upper performance bound around 6Gbit when going through netfilter
New vendor extension action specifies zone and whether or not to recirculate
Interest exists to supply a compatible userspace implementation (e.g.
PF based or different) - even potentially accelerated - however
interface needs to be clean enough for this (not just mirror Linux's
interface)
Metadata handling needs to be considered
NAT
Thomas posted patch to add NAT action assuming connection tracking
state already exists
Would support stateful NAT (translate L4 ports)
Tricky to handle bidirectional traffic
Easiest is to position on one "side" of the switch e.g. at egress to
public port / at ingress from public port - to ensure OpenFlow always
sees private addresses
Initially no need to expose table contents to OpenFlow therefore OK to
expose as NAT + un-NAT actions deployable at separate places in packet
processing pipeline
Expose as synchronized tables once controller vendors want to see
contents of NAT tables
Again Linux kernel datapath only
Crypto (IPsec)
Similar to conntrack - currently kernel feature, would be missing in
userspace etc.
Prefer to have a mechanism to deploy in userspace, kernel and accelerators
Question is whether OK to keep outside OpenFlow vs. whether parts?
all? needs to be exposed to OpenFlow
eBPF based datapath (Alexei S. @ PLUMgrid)
Current focus of eBPF is on tracing
Motivation to integrate eBPF with the OVS kernel datpath
Provides additional programmability similar to P4 vs OF 1.x - some
semantics e.g. complex logic difficult to express using tables
(Separate) Use cases:
Non OpenFlow - e.g. potentially traditional networking - flexibly
deploy e.g. L2 with learning etc.
High level optimization similar to nftables
Possible approach to protocol independent parsing exposed through OF
Can act as glue between tables, does not need to replace matching in
particular tables
Could also replace / encode parsing logic
Incremental parsing (for improved performance) vs up front complete parsing
To reconcile this with option 1... can use a BPF based pre-flow-table
option (for parsing only) and post-flow-table option (further
processing)
If table is empty can further optimize this by having pre and post be
replaced with simpler unified one
Is this needed though - if we ignore PIF type usage? Could for
example use this to obtain TCP window sizes for analytics...
Q: Does this need to integrate with with OVS? Why not just hook into
ingress at netdev? A: More options available w.r.t. where to divert
traffic when integrated.
Option 1a: Keep existing megaflow hash tables and call eBPF on flow miss
Option 1b: eBPF as an action
Consensus that this is the easiest to implement
BPF program is provided by user space (not necessarily exposed to
controller - initially not)
Could provide an easy angle for new actions without requiring to go
through the heavy process of adding a new datapath action
Not as flexible as C, which is good, as can potentially compile to
certain hardware platforms
Option 2: Replace full lookup & execution with BPF code
Potential Option 4: Table matches fields, additional table column
contains expression which also needs to be matched (here tables are
main control logic, expression is add on to each row)
Expressiveness: limited execution time run to completion (no loops);
can call out to functions (which could be implemented in hardware or
software)
Conceptually a program is set of connected netdevs, each with multiple
ports, which are connected in some topology; can collapse some nodes
into fewer for improved performance
Potential concerns on compatibility with existing ABI and requirement
on maintaining two parallel datapath implementations going forward
(flow lookup and BPF)
Would need to keep the old configuration ABI. Possibly provide compat
through a BPF program. Initially retain existing C code for parsing
as faster anyway.
Can't break userspace if default behavior remains unchanged as
userspace would know whether a program / which program has been
downloaded
Need to constrain which kernel functions are permissible to call -
e.g. output to port, add header, compute checksums
Especially important when permitting userspace and accelerated target
platforms too
Take care not to disrupt existing GSO checksum handling e.g. related
metadata / flags / offsets prepended to packets, or explicitly permit
these to be set
Code can be made availble after a rebase onto the latest BPF changes
Exposing the idea of BPF to the controller opens a new set of questions
Conceptually need overall control flow mechanism (around say OVS,
IPsec, QoS etc), and a detailed packet manipulation mechanism - need
to decide which of these eBPF will perform (only detailed vs both...)
and how to expose this
Would determine where to hook it in and which people need to be
involved (OVS vs general Linux community vs. OpenFlow etc.)
Steps forward:
0. Add BPF program invocation to sockets
0.5 Add Extend cls_bpf with eBPF capability ( daniel will take care ;))
1. Add read-only BPF program as actions to OVS - used for convenience
of userspace - not exposed to OpenFlow (not even as custom action)
2. Enable programs to write to packets and forward packets to ports -
again initially not exposed to OpenFlow
3. Add ABI to handle encapsulation w/ offload
4. Add possibility to run BPF program on flow miss
M. Implement only in userspace without kernel... e.g. on DPDK (for
some value of N... depends on market demand)
Enables BPF programs to be exposed - e.g. downloaded by controller -
and running on the various available hardware / software platforms
N. Implementations for acceleration hardware
N+1.eBPF only OVS data path
Further discussion of the "outer control flow" in ONF Forwarding
Abstractions WG, and of Protocol Independent Forwarding part in ONF
PIF open source project
Schedule follow up discussions on next meetups
Start advertising the idea on blog
OpenFlow API for encap metadata
Geneve and other encap protocol introduces metadata (options conveyed
in packets)
The question is how to expose this metadata with OpenFlow
Considering passing through to userspace and beyond as opaque values somehow
GENEVE type space is large - would consume entire OXM space => need to
extend OXM class
Recently experimenter OXM ID space size was reduced - see
https://rs.opennetworking.org/bugs/browse/EXT-380 and ensuing
discussion
Nevertheless could use experimenter OXM encoding for this - use a
dedicated experimenter ID
Desire to handle properitary encap protocols with metadata in a way
that allows mapping to Geneve TLVs in the future
An eBPF converter to map generic tunnel metadata to specific protocol
headers would provide sufficient flexibility
Issues are representing this within a switch, accessible via matching
/ actions, and across the network, as a lighter weight than packet
in/out but more expressive than tunnel format
Zoltan - packet processors - https://rs.opennetworking.org/bugs/browse/EXT-122
Examples of issues with existing logical port scheme: cannot chain
these, cannot perform variable actions if MTU exceeded or not etc.
Therefore need more flexible mechanism
Can perform opaque operation in ASIC, or pipe to control processor to
perform it, and back
Invoke these via experimenter IDs
See also tasks proposal - slides attached to
https://rs.opennetworking.org/bugs/browse/EXT-494 - this refactors
action set/list, actions vs instructions, flow vs group vs egress
tables etc.
See also protocol independent forwarding - would have built in actions
/ functions as well as external names opaque functions which can be
invoked
Other potential features to work on
No major wish list items for OpenFlow control protocol level since
more recent OpenFlow versions have been implemented
QoS / metering issues: accuracy of implementations (easier to achieve
with hardware than software), representaton in OpenFlow / OF-Config
poorly defined
John to provide RFC patchset to allow hardware offload of TBF per
queue and eventually HTB for flat hierarchy
Consider deriving abstraction from the various implementations - then
define generic way to expose to OpenFlow / OF-Config / OVSDB etc.
Consensus on organizing meetups like this again in the future
Perhaps paste wishlist items into a document
More information about the dev
mailing list