[ovs-dev] FW: OVS-DPDK full offload RFC proposal discussion

Mon Jan 29 09:11:38 UTC 2018

Initial discussion on the OVS-DPDK RFC approach.
We will be discussing about the following points in the meeting today.

Regards
_Sugesh

From: Finn Christensen [mailto:fc at napatech.com]
Sent: Friday, January 26, 2018 1:41 PM
To: Chandran, Sugesh <sugesh.chandran at intel.com>; Loftus, Ciara <ciara.loftus at intel.com>; Doherty, Declan <declan.doherty at intel.com>
Subject: RE: OVS-DPDK full offload RFC proposal discussion

Thanks Sugesh,

See my comments below.

I'll be on the conf call on Monday.

Regards,
Finn

From: Chandran, Sugesh [mailto:sugesh.chandran at intel.com]
Sent: 25. januar 2018 21:33
To: Finn Christensen <fc at napatech.com<mailto:fc at napatech.com>>; Loftus, Ciara <ciara.loftus at intel.com<mailto:ciara.loftus at intel.com>>; Doherty, Declan <declan.doherty at intel.com<mailto:declan.doherty at intel.com>>
Subject: RE: OVS-DPDK full offload RFC proposal discussion

Hi Finn,

Once again thank you for putting these up.
Please find my comments inline below.

Regards
_Sugesh

From: Finn Christensen [mailto:fc at napatech.com]
Sent: Tuesday, January 23, 2018 11:42 AM
To: Chandran, Sugesh <sugesh.chandran at intel.com<mailto:sugesh.chandran at intel.com>>; Loftus, Ciara <ciara.loftus at intel.com<mailto:ciara.loftus at intel.com>>; Doherty, Declan <declan.doherty at intel.com<mailto:declan.doherty at intel.com>>
Subject: OVS-DPDK full offload RFC proposal discussion

Hi Sugesh,

My apology for not sending this earlier.
As discussed in meeting, I here send you a semi-detailed description of how we see the next step towards OVS-DPDK hw full offload.
Please add all the Intel people who wants to participate in this email thread.

Proposal: OVS changes for full offload, as an addition to the partial offload currently proposed.

Generally let the hw-offloaded flow match+action functionality be a slave of the megaflow cache. Let it be seamlessly offloaded when applicable (when all flow actions are in the range of supported actions implemented). Otherwise failover to partial offload, and if no success, normal SW switching will be used.

1)      Handle OUTPUT action:
Map odp_port_no to DPDK port_id, so that an OVS_ACTION_ATTR_OUTPUT may be converted into a netdev_dpdk device known port_id. If the port is not found in dpdk_list, or the specific dpdk device does not handle hw-offloading, do partial-offload (don't use actions besides the partial-offload added MARK and RSS).
Multiple OUTPUT actions may be specified (in case of flooding), then don't full offload.
a.      Register ODP port number in netdev_dpdk on DPDK_DEV_ETH instances (put odp_port_no in netdev_dpdk structure).
b.      In netdev_dpdk_add_rte_flow_offload() function, catch OVS_ACTION_ATTR_OUTPUT and find the dpdk_dev from dpdk_list of which matches its odp_port_no. Then setup a RTE_FLOW_ACTION_TYPE_ETHDEV_PORT containing DPDK port_id for target port.
[Sugesh] Yes, that make sense. Do you think the representor port can also be defined as normal DPDK ports? We are experiencing some difficulties when trying to overload the same DPDK port for representor ports/accelerated ports. More comments below.
[Finn] Yes. But you are right, if you need special OVSDB settings to configure a vport, then you will need a new DPDK type. However, initially, we do not necessarily need this. I do not see this as a huge issue, and if we need it I think we can add that also to the patchset.

2)      Handle statistics:
Separate registration/mapping of partial offloaded flows and full offloaded flows and query statistics from full offloaded flows with a specific interval, updating the userspace datapath megaflow cache with these statistics. Done using rte_flow_query. This includes packet count (hits), bytes and seen tcp_flags.
a.      When a full offloaded flow has been successfully added, then add that rte_flow to a separate hw-offload map, containing only full-offloaded-flows.
[Sugesh] Yes. We also following the same method
b.      Add the RTE_FLOW_ACTION_TYPE_COUNT to the full-offloaded flows, so that statistics may be retrieved later, for that rte_flow.
[Sugesh] Make sense.
c.      Add a timed task to the hw-offload-thread, so that all full-offloaded flows can be stat-queried using rte_flow_query() function. Retreived with an interval of maybe 1 or 2 seconds. Call dp_netdev_flow_used with result.
[Sugesh]Ok, so we might need to use the stats in revalidator to expire the flows?
Just a note, some hardware may able to evict the flows by itself after the idle-timeout. The rte_flow_query logic should account that as well when polling the stats.
[Finn] Yes, good point. Let the flow_query also indicate if a flow has been canceled, and remove it accordingly in flow map.

d.      tcp_flags should be retrieved by rte_flow_query() also. This will need an extension to the current rte_flow_query_count structure.
[Sugesh] Ok
e.      Use the flow_get function in the DPDK_FLOW_OFFLOAD_API to implement the rte_flow_query call and convert format to dpif_flow_stats.
[Sugesh] Yes.

3)      OVS with hw-offloaded virtual ports:
NIC virtual ports (VFs or PMD local queue(s)), should not be specifically known by OVS, other than through a "normal" dpdk port (type=dpdk). This port is then a representor port, like any other phy port.
I know Intel has proposed an representor port proposal to DPDK, but IMO, how a PMD registers virtual ports should be a DPDK PMD implementation matter and therefore should be ignorant to OVS and thus not part of this RFC proposal.
[Sugesh] This is bit of tricky, because its very likely that the type/mode of virtual port is solely depends on the hardware, it can be VFs, vhost ports or anything else. Representing these ports as normal DPDK port in OVS is bit confusing.
Also we need to look at how these ports are being managed from an orchestrator, how it is differ from normal port.
[Finn] Well, the orchestrator should definitely know about these ports being virtual hw ports, and that should be possible to orchestrate these resources accordingly, but does OVS need to know? If so, it's fine with me. We can discuss this issue further on the conf call.

This proposal needs a few extensions to DPDK RTE FLOW api.
a.      A RTE_FLOW_ACTION_TYPE_ETHDEV_PORT  - to specify a target port (as proposed earlier by Intel on DPDK ML)
[Sugesh] There is already a work going on for OUTPUT action.
b.      Extend rte_flow_query to be able to add tcp_flags and potentially handle bulk requests.
[Sugesh] need to work on this. Also need to think about handling it when hardware cannot support the query.
[Finn] If tcp_flags are not in stats, then the question is if it still will be valid to fully offload. Anybody has a comment on that? - should that be known to an orchestrator?

Furthermore, it is based on a trial and error approach without specific device capability knowledge. Mostly because there are no device capability feature available yet. I think it can be added later as a separate issue, not necessarily hard bound to the full offload functionality.
[Sugesh] Totally agree with that.

I thought this was a good level of detail to start out with.
[Sugesh] I am setting up a call coming Monday(The meeting invite follows). Hope you can make it. So that we can discuss in detail about these points.

Thanks,
Finn

Disclaimer: This email and any files transmitted with it may contain confidential information intended for the addressee(s) only. The information is not to be surrendered or copied to unauthorized persons. If you have received this communication in error, please notify the sender immediately and delete this e-mail from your system.
Disclaimer: This email and any files transmitted with it may contain confidential information intended for the addressee(s) only. The information is not to be surrendered or copied to unauthorized persons. If you have received this communication in error, please notify the sender immediately and delete this e-mail from your system.