[ovs-dev] [PATCH 00/25] netdev datapath vxlan offload

Eli Britstein elibr at mellanox.com
Wed Feb 5 14:30:46 UTC 2020


On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> Hi Eli,
>
> Thanks for sending this patchset. I have some questions about the
> design, please see my comments below.
>
> Thanks,
> -Harsha
>
> On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein <elibr at mellanox.com> wrote:
>> In the netdev datapath, packets arriving over a tunnel are processed by
>> more than one flow. For example a packet that should be decapsulated and
>> forwarded to another port is processed by two flows:
>> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>>     actions:tnl_pop(vxlan_sys_4789).
>> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>>     inner-header matches, actions: vm1.
>>
>> In order to offload such a multi-flow processing path, a multi-table HW
>> model is used.
> Did you explore other ways to avoid this multi-flow processing model ?
> For example, may be support some additional logic to merge the two
> datapath flow-rules for decap into a single flow-rule in the offload
> layer ?
Yes, we did explore this approach and our research lead us to conclude 
that following the OVS multi-table SW model in HW proves to be a more 
robust architecture.
Single table architecture for tunnel offloads, for example, has the 
following issues:
1. Statistics – flow #1 counts bytes including outer header. If the 
outer is IPv6, it can have extensions, to be counted by the packet. Flow 
#2 counts the bytes of the inner packet only.
2. Update rate – parsing multi-table to single one means cartesian 
product of all the flows. Any change of a single flow in one table means 
a lot of removals and insertions of rules in HW.
The single table model also breaks when offloading more complex actions 
such as CT, dp_hash and others.
>
>> Flow #1 matches are offloaded as is, but tnl_pop is
>> translated to MARK and JUMP to a tnl-port table actions.
> The response to tnl_pop action is now different between non-offload
> and offload datapaths, since offload layer changes it into entirely
> different
> actions. The user would expect consistent response to tnl_pop action,
> especially if you treat this flow rule as an independent rule by
> itself (i.e,
> assuming there's no flow #2 subsequently).
>
> Regarding the JUMP action:
> IMO, offload framework should avoid anything that is too vendor
> specific; and make use of simple/common abstractions. I'm not sure if
> JUMP action is really needed to support offload of tnl_pop action.
> Also, it may not map to the real use case for which JUMP action seems
> to have been defined, which is to support flow-grouping (see
> rte_flow.h).
The rte_flow API does not offer an equivalent action to moving outer 
headers to some metadata buffer, like the SW tnl_pop action. The JUMP 
action is equivalent to recirculate on the vport, and the MARK action is 
to make sure the user will have a consistent response in SW/HW (again, 
sticking to the SW model), so in case of a miss the packet is recovered 
to continue SW processing normally.
>
>> Flow #2 is
>> offloaded to that tnl-port table, its tunnel matches are translated to
>> outer-header matches, followed by the inner-header ones. The actions
>> start with an implicit DECAP action followed by the flow's actions.
>> The DECAP is done in flow #2. If it was done in flow #1, the outer
>> headers of the packets would have been lost, and they could not have
>> been matched in flow #2.
> The only missing match field that would allow decap on flow #1 itself,
> seems to be the tunnel-id. If that is provided in flow #1, then we may
> not need to match it again in flow #2. And flow #2 could just match on
> the inner packet headers. This would also make handling the flow-miss
> simple, since you wouldn't need the recovery logic (offload layer
> popping the header etc). The packet would be in the right state for
> the next stage of processing in SW, if there's a  flow-miss after the
> first stage in HW. It also means tnl_pop can be done in flow #1,
> making the behavior consistent with non-offload processing.
>
> The other advantage of providing the tunnel-id in flow #1 is that,
> then we wouldn't need to match on the entire packet including the
> outer-headers in the second stage and only the inner header fields
> would need to be matched in flow #2.
Even if we could somehow get the matches from flow #2 to flow #1 
(leading back to single flow model…) and do the DECAP in flow #1, then 
we would have to change the SW matches in flow #2 in case it is 
processed in SW as it won’t have the outer header/tunnel anymore.
>
>> Furthermore, once the HW processing path is composed of a multi table
>> processing, there is a need to handle a case where the processing starts
>> in HW, but is not completed in HW. For example, only flow #1 is
>> offloaded while flow #2 doesn't exist. For that, the MARK value set in
>> the tnl_pop action is received in SW, the packet is recovered by the
>> mapped vport, and processing continues in SW.
> If it is not possible to provide tunnel-id in flow #1, then the other
> alternative might be to defer offloading until flow #2 and offload
> only flow #2. You are anyway matching almost the entire outer-headers
> and the inner-header in flow #2 and the decap action is also added
> implicitly to flow #2. Why not totally avoid offloading flow #1 ? That
> way, none of the miss handling code is required.
See the first argument about following the SW model.
>
>> As vports are virtual and cannot have rte_flows, offloading a vport
>> generates rte_flows for every PF in the system, as we cannot know from
>> which one the packets arrive from.
> Is it possible to track the "original in-port" through this two-stage
> flow processing and generate rte_flow only on the specific PF that
> received the
> packet ?
We could “track” it only if we tried to merge both flow #1 and flow #2. 
See the first argument about following the SW model.
>
> Thanks,
> -Harsha
>> This series adds support for IPv6, tunnel push actions, and the
>> multi-table model for the decap processing.
>>
>> v1 Travis passed: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&reserved=0
>>
>> Eli Britstein (14):
>>    dpif-netdev: Add mega ufid in flow add log
>>    netdev-offload-dpdk: Remove pre-validate of patterns function
>>    netdev-offload-dpdk: Add IPv6 pattern matching
>>    netdev-offload-dpdk: Support offload of clone tnl_push/output actions
>>    netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
>>    netdev-offload: Don't use zero flow mark
>>    netdev-offload-dpdk: Introduce map APIs for table id and miss context
>>    netdev-offload-dpdk: Implement HW miss packet recover for vport
>>    netdev-offload-dpdk: Refactor disassociate ufid function
>>    netdev-offload-dpdk: Support tunnel pop action
>>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>    netdev-dpdk: Getter function for dpdk devargs API
>>    netdev-dpdk: Introduce get netdev by devargs function
>>    netdev-dpdk-offload: Add vxlan pattern matching function
>>
>> Ilya Maximets (4):
>>    netdev: Allow storing dpif type into netdev structure.
>>    netdev-offload: Use dpif type instead of class.
>>    netdev-offload: Allow offloading to netdev without ifindex.
>>    netdev-offload: Disallow offloading to unrelated tunneling vports.
>>
>> Ophir Munk (6):
>>    dpif-netdev: Make mark allocation public API
>>    dpif-netdev: Add HW miss packet state recover logic
>>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>      port
>>    netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
>>    netdev-dpdk: Add flow_api support for netdev vxlan vports
>>    netdev-offload-dpdk: Support offload for vxlan vport
>>
>> Oz Shlomo (1):
>>    netdev-offload: Add HW miss packet state recover API
>>
>>   Documentation/howto/dpdk.rst  |    5 +-
>>   NEWS                          |    6 +-
>>   lib/dpif-netdev.c             |   70 +--
>>   lib/dpif-netlink.c            |   23 +-
>>   lib/dpif.c                    |   21 +-
>>   lib/netdev-dpdk.c             |   68 +++
>>   lib/netdev-dpdk.h             |    6 +
>>   lib/netdev-offload-dpdk.c     | 1275 +++++++++++++++++++++++++++++++++++------
>>   lib/netdev-offload-provider.h |    6 +
>>   lib/netdev-offload-tc.c       |   11 +-
>>   lib/netdev-offload.c          |  115 ++--
>>   lib/netdev-offload.h          |   23 +-
>>   lib/netdev-provider.h         |    3 +-
>>   lib/netdev.c                  |   16 +
>>   lib/netdev.h                  |    2 +
>>   ofproto/ofproto-dpif-upcall.c |    5 +-
>>   tests/dpif-netdev.at          |   14 +-
>>   tests/ofproto-macros.at       |    3 +-
>>   18 files changed, 1394 insertions(+), 278 deletions(-)
>>
>> --
>> 2.14.5
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&reserved=0


More information about the dev mailing list