[ovs-dev] [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow

Wed Jan 10 14:53:19 UTC 2018

Hi All,

Based on the comments below, I wanted to call out for a meeting to decide what further needs to be done to handle different kinds of hardware acceleration in OVS-DPDK. We have some POC implementation that partially handles some of the cases , But it only work with some of Intel hardware acceleration products. I also did a presentation about that proposal in the last OVS conference. I feel its time to discuss now to come up with a design that work for all different hardware acceleration devices and possible acceleration modes.
If all are agree on the above suggestion, I can send out for a meeting invite sometime next week.(May be Jan15th, Monday 5:00PM GMT). Let me know if this timeslot is OK for everyone.

I am ccing the 'dev' mailing list as well here in case if anyone else wanted to join the call and provide the inputs.

Regards
_Sugesh

> -----Original Message-----
> From: Chandran, Sugesh
> Sent: Wednesday, January 10, 2018 2:38 PM
> To: 'Yuanhan Liu' <yliu at fridaylinux.org>
> Cc: Finn Christensen <fc at napatech.com>; dev at openvswitch.org; Darrell Ball
> <dball at vmware.com>; Simon Horman <simon.horman at netronome.com>;
> Stokes, Ian <ian.stokes at intel.com>
> Subject: RE: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
> 
> Hi Yuanhan,
> 
> Thank you for your reply!..and for the wishes too :)
> 
> 
> At very high level, my comments are on the approach taken for enabling
> hardware acceleration in OVS-DPDK. In general, the proposal works very well for
> the specific partial offload use case. i.e alleviate the miniflow extract and
> optimize the megaflow lookup. I totally agree on the performance advantage
> aspect of it. However I expect the hardware acceleration in OVS-DPDK should be
> in much more broad scope than just limited to a flow lookup optimization. I had
> raised a few comments around these point earlier when the previous versions of
> patches were released.
> 
> Once again I am sharing my thoughts here to initiate a discussion on this area. In
> my point of view, hardware acceleration in OVS-DPDK will be,
> 1)	Must understand the hardware and its capabilities. I would prefer a
> proactive hardware management than the reactive model. This approach let the
> software collect the relevant hardware information in advance before
> interacting with it. It will have impact on all the port specific operations
> including port init, device programming and etc.
> 2)	Define hardware acceleration modes to support different acceleration
> methods. It will allow to work with various hardware devices based on its
> capabilities.
> 3)	Its nice to have to make OVS software to use device + port model. This
> is helpful when enumerating device bound characteristics (such as supported
> protocols, tcam size, and etc). Also it helps when handling flows that span across
> different hardware acceleration devices.
> 4)	Having device model as mentioned in (3) helps OVS to distribute
> hardware flow install into multiple threads than just single thread in the current
> implementation. May be its possible to have one flow thread per device to avoid
> the contention.
> 5)	The action translation and flow validation for the hardware acceleration
> has to be device centric. So having one set of functions may not work for other
> use cases. Consider another partial hardware acceleration approach 'zero copy'
> where packets are copied directly to the VM queues by the NIC. For zero copy
> acceleration,  the actions cannot be 'mark and RSS ' instead it  will be specific
> queue action. Similarly different acceleration mode will have its own validations
> and set of actions. May be its good to have function hooks for these operation
> based on acceleration mode.
> 
> Considering there are no DPDK support available for handle most of the above
> points, I don't mind pushing this patch into OVS mainline. However I would
> expect a refactoring of that code in the future to make it generic enough for
> supporting different use cases and modes. By having a generic approach it is
> possible to support various partial acceleration & full acceleration modes based
> on the available hardware.
> 
> Regards
> _Sugesh
> 
> > -----Original Message-----
> > From: Yuanhan Liu [mailto:yliu at fridaylinux.org]
> > Sent: Monday, December 25, 2017 6:41 AM
> > To: Chandran, Sugesh <sugesh.chandran at intel.com>
> > Cc: Finn Christensen <fc at napatech.com>; dev at openvswitch.org; Darrell
> > Ball <dball at vmware.com>; Simon Horman
> <simon.horman at netronome.com>;
> > Stokes, Ian <ian.stokes at intel.com>
> > Subject: Re: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
> >
> > Hi Sugesh,
> >
> > On Fri, Dec 22, 2017 at 06:02:16PM +0000, Chandran, Sugesh wrote:
> > > Hi Yuanhan,
> > >
> > > Thank you for working on this.
> > > I will be reviewing this patch when I am back office after my
> > > Xmas/New Year
> > vacation.
> >
> > Thank you for that. And enjoy your new year vacation!
> >
> > > I do have some comments as well as the questions on how this
> > > proposal is
> > going to work with the different hardwares, full acceleration, and
> > other partial offload options. Lets have that conversation after the
> > holidays and see whats the best way to get it work for most of the cases.
> >
> > I think most of this patchset is generic enough, including translating
> > "match" to rte flow patterns, doing offload in another thread.
> >
> > What is not really generic enough is the flow mark part. Firstly, it's
> > a partial offload. Secondly, it may not work well with Intel NIC (and
> > other NICs) due to the MAKR + RSS action combination: some NIC might
> > only support one of the two actions. Also, the code path for full
> > acceleration should be different.  But I think none of them will be a
> > real matter. Say, for the full acceleration, there would be no MARK
> > attached in the mbuf, then the flow mark code path is totally passed, while the
> flow offload installing path is re-used.
> >
> > Probably what we need to do is to determine when to do a full offload,
> > or partial offload (say, the flow mark) at somewhere (i.e. netdev-dpdk.c).
> > Based on the decision, we chose the proper flow action for the
> > different kinds of offload.
> >
> > I think we will know that when we are dealing with full offload later.
> > And I also think this patchset is a good start for the OVS-DPDK flow offload.
> >
> > 	--yliu
> > >
> > >
> > >
> > >
> > >
> > > Regards
> > > _Sugesh
> > >
> > >
> > > > -----Original Message-----
> > > > From: Finn Christensen [mailto:fc at napatech.com]
> > > > Sent: Friday, December 22, 2017 11:29 AM
> > > > To: Yuanhan Liu <yliu at fridaylinux.org>; dev at openvswitch.org
> > > > Cc: Darrell Ball <dball at vmware.com>; Chandran, Sugesh
> > > > <sugesh.chandran at intel.com>; Simon Horman
> > > > <simon.horman at netronome.com>; Stokes, Ian <ian.stokes at intel.com>
> > > > Subject: RE: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
> > > >
> > > > Hi,
> > > >
> > > > The addition of the offload thread is an great addition,
> > > > furthermore, RSS action usage for target queues, works now well
> > > > with our NICs, I just had to do some additional drive work, to make it all
> work.
> > > >
> > > > The support for RSS removed the problem with rte-flow queue
> > > > selection, but RSS in general added an additional challenge as now
> > > > multiple pmd's may request the same flow offload (as Yuanhan
> > > > pointed out). The solution has worked without issues for me. I
> > > > have run tests with 1 and 2 queues in a PV and PVP setup going over virtio.
> > > > Yuanhan mentioned tests in a P2P setup with high throughput
> > > > acceleration. I have focused on a PVP scenario, which, in
> > > > addition, will crosses virtio, and most importantly, has
> > > > acceleration in Rx direction only (seen from VM). It still gives
> > > > fairly good performance
> > improvements.
> > > >
> > > > Here are my initial test results.
> > > >
> > > > Percentage improvements:
> > > > 1 RX/Tx queue:
> > > > 1     megaflow - 10K flowsPV: 73%
> > > > 10  megaflows - 100K flowsPVP: 50%PV: 56%
> > > > 100 megaflows - 1M flowsPVP: 42%PV: 49%
> > > > 1000 megaflows - 10M flowsPVP: 40%PV: 42%
> > > >
> > > > With RSS: 2 Rx/Tx queue pairs:
> > > > 1     megaflow - 10K flowsPV: 55%
> > > > 10  megaflows - 100K flowsPVP: 36%PV: 38%
> > > > 100 megaflows - 1M flowsPVP: 27%PV: 24%
> > > > 1000 megaflows - 10M flowsPVP: 21%PV: 24%
> > > >
> > > > PVP for 1 megaflow offload is omitted because my VM-app imposed a
> > > > bottle- neck.
> > > >
> > > > Regards,
> > > > Finn
> > > >
> > > >     -----Original Message-----
> > > >     From: Yuanhan Liu [mailto:yliu at fridaylinux.org]
> > > >     Sent: 20. december 2017 15:45
> > > >     To: dev at openvswitch.org
> > > >     Cc: Finn Christensen <fc at napatech.com>; Darrell Ball
> > > >     <dball at vmware.com>; Chandran Sugesh
> <sugesh.chandran at intel.com>;
> > > >     Simon Horman <simon.horman at netronome.com>; Ian Stokes
> > > >     <ian.stokes at intel.com>; Yuanhan Liu <yliu at fridaylinux.org>
> > > >     Subject: [PATCH v5 0/5] OVS-DPDK flow offload with rte_flow
> > > >
> > > >     Hi,
> > > >
> > > >     Here is a joint work from Mellanox and Napatech, to enable the flow hw
> > > >     offload with the DPDK generic flow interface (rte_flow).
> > > >
> > > >     The basic idea is to associate the flow with a mark id (a unit32_t
> number).
> > > >     Later, we then get the flow directly from the mark id, which could bypass
> > > >     some heavy CPU operations, including but not limiting to mini
> > > > flow
> > extract,
> > > >     emc lookup, dpcls lookup, etc.
> > > >
> > > >     The association is done with CMAP in patch 1. The CPU workload
> > bypassing
> > > >     is done in patch 2. The flow offload is done in patch 3, which mainly does
> > > >     two things:
> > > >
> > > >     - translate the ovs match to DPDK rte flow patterns
> > > >     - bind those patterns with a RSS + MARK action.
> > > >
> > > >     Patch 5 makes the offload work happen in another thread, for leaving
> the
> > > >     datapath as light as possible.
> > > >
> > > >     A PHY-PHY forwarding with 1000 mega flows
> > > > (udp,tp_src=1000-1999) and
> > 1
> > > >     million streams (tp_src=1000-1999, tp_dst=2000-2999) show more than
> > > >     260% performance boost.
> > > >
> > > >     Note that it's disabled by default, which can be enabled by:
> > > >
> > > >         $ ovs-vsctl set Open_vSwitch .
> > > > other_config:hw-offload=true
> > > >
> > > >
> > > >     v5: - fixed an issue that it took too long if we do flow add/remove
> > > >           repeatedly.
> > > >         - removed an unused mutex lock
> > > >         - turned most of the log level to DBG
> > > >         - rebased on top of the latest code
> > > >
> > > >     v4: - use RSS action instead of QUEUE action with MARK
> > > >         - make it work with multiple queue (see patch 1)
> > > >         - rebased on top of latest code
> > > >
> > > >     v3: - The mark and id association is done with array instead of CMAP.
> > > >         - Added a thread to do hw offload operations
> > > >         - Removed macros completely
> > > >         - dropped the patch to set FDIR_CONF, which is a workround some
> > > >           Intel NICs.
> > > >         - Added a debug patch to show all flow patterns we have created.
> > > >         - Misc fixes
> > > >
> > > >     v2: - workaround the queue action issue
> > > >         - fixed the tcp_flags being skipped issue, which also fixed the
> > > >           build warnings
> > > >         - fixed l2 patterns for Intel nic
> > > >         - Converted some macros to functions
> > > >         - did not hardcode the max number of flow/action
> > > >         - rebased on top of the latest code
> > > >
> > > >     Thanks.
> > > >
> > > >         --yliu
> > > >
> > > >
> > > >     ---
> > > >     Finn Christensen (1):
> > > >       netdev-dpdk: implement flow offload with rte flow
> > > >
> > > >     Yuanhan Liu (4):
> > > >       dpif-netdev: associate flow with a mark id
> > > >       dpif-netdev: retrieve flow directly from the flow mark
> > > >       netdev-dpdk: add debug for rte flow patterns
> > > >       dpif-netdev: do hw flow offload in a thread
> > > >
> > > >      lib/dp-packet.h   |  13 +
> > > >      lib/dpif-netdev.c | 473 ++++++++++++++++++++++++++++++++++-
> > > >      lib/flow.c        | 155 +++++++++---
> > > >      lib/flow.h        |   1 +
> > > >      lib/netdev-dpdk.c | 732
> > > >     +++++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > >      lib/netdev.h      |   6 +
> > > >      6 files changed, 1341 insertions(+), 39 deletions(-)
> > > >
> > > >     --
> > > >     2.7.4
> > > >
> > > > Disclaimer: This email and any files transmitted with it may
> > > > contain confidential information intended for the addressee(s)
> > > > only. The information is not to be surrendered or copied to
> > > > unauthorized persons. If you have received this communication in
> > > > error, please notify the sender immediately and delete this e- mail from
> your system.