[ovs-dev] [RFC V2] netdev-rte-offloads: HW offload virtio-forwarder

Wed May 15 11:24:46 UTC 2019

On 06.05.2019 13:43, Roni Bar Yanai wrote:
> Background
> ==========
> OVS HW offload solution is consisted of forwarding and control. HW implements
> embedded switch that connects SRIOV VF's and forwards packets according to the
> dynamically configured HW rules (packets can be altered by HW rules). Packets
> that have no forwarding rule, called exception packets, are sent to the control
> path (OVS SW). OVS SW will handle the exception packet (just like in SW only
> mode),  namely calling up-call if no DP flow exists. OVS SW will use port
> representor for representing the VF. see:
> 
>     (_https://doc.dpdk.org/guides/prog_guide/switch_representation.html)._
> 
> Packets sent from VF will get to the port representor and packets
> sent to the port representor will get to the VF. Once OVS SW generates a
> data plane flow, a new HW rules will be configured in the embedded switch.
> following packet on the on the same flow will be directed by HW only. Will
> arrive directly from VF (also uplink) to VF without getting to SW.
> 
> For some HW architecture only the shortly presented SRIOV hw offload
> architecture is supported. SRIOV architecture requires that the guest will
> install a driver which is specific for the underlying HW. Specific HW driver
> interduces two main problems for virtualization:
> 
> 1. It breaks virtualization in some sense (VM aware of the HW).
> 2. less natural support for live migration.
> 
> Using virtio interface solves both problems (on the expense of some loss in
> functionality and performance). However, for some HW offload, working directly
> with vitrio cannot be supported.
> 
> HW offload for virtio architecture
> ====================================
> We suggest an architecture for HW offload of virtio interface that
> adds another component called virtio-forwarder on top of the current
> architecture. The forwarder is a software or hardware (for vdpa) component that
> connects the VF with a matching virtio interface as shown below:
> 
>        | PR1              -----------
>      --|--               |           |
>     |     |              | forwarder |
>     | OVS |              |           |
>     |     |              -------------         ---------
>      -----                | VF1   | virtio1   |         |
>        | uplink           |       |           |  guest  |
>   -----------------       |       \ ----------|         |
> |                  |----- /                    ---------
> |      e-switch    |
> |                  |
>  ------------------
> 
> The forwarder role is to function as a wire between the VF and the virtio.
> Forwarder reads packets from the rx-queue and sends them to the peer
> tx-queue (and vice versa). since the function in this case is reduced to
> forwarding packets without inspecting them, a single core can push a very high
> number of PPS (near DPDK forwarding performance).
> 
> There are 3 sub use cases.
> 
> OVS-dpdk
> --------
> This is the basic use case that was just described. In this use case we have
> port representor, VF and virtio (forwarding should be done between VF and virtio).
> 
> Vdpa
> -----
> Vdpa enables the HW to directly put the packets in the VM virtio. In this case
> the forwarding is done in HW, but it requires that some SW will handle the
> control. Configure the queues and adjust configuration according to VHOST
> updates.
> 
> OVS-kernel
> ----------
> OVS-kernel HW offload has the same limitation. However, in the case of
> just forwarding packets, DPDK has a great performance advantage over the kernel.
> It would be good to also add this use case, looking on implementation
> effort and performance gain.
> 
> Why not just use standalone (DPDK test PMD)?
> ----------------------------------------
> 1. When HW-offload is running, we expect that most of the traffic will be
>    handled by HW, so the PMD thread will be mostly idle. we don't want to burn
>    another core for the forwarding.
> 
> 2. Standalone application is another application with all the additional
>    overheads: start, configuration, monitoring...etc. besides being another
>    project which means another dependency.
> 
> 3. Using already existing OVS load balancing and NUMA awareness. Forwarding
>    should have the exact same symptoms of unbalanced workload as regular rx-queue
> 
> 4. We might need to have some prioritization, exception packets are more
>    important than forwarding. Being on the same domain will make it possible
>    to add such prioritization while reducing CPU requirement to minimum.
> 
> OVS virtio-forwarder
> ====================
> The suggestion is to put the wire and control functionality in the hw-offload
> module. Looking on the forwarder functionality we have control and data.
> The control is the configuration: Virtio/VF matching (and type). queues
> configuration (defined when VM initialized, and can change)...etc. The data is
> the actual forwarding that needs a context to run it. As explained, forwarding
> is reduced to a simple rx-burst and tx-burst where all can be predefined
> after the configuration.
> 
> We add the forwarding layer to the hw offload module and we configure it
> separately. For example:
> 
> ovs-appctl hw-offload/set-fw
>            vhost-server-path=/tmp/dpdkvhostvm1:rxq=2 dpdk-devargs=0000:08:00.0
>            type=pr:[1]
> 
> Once configured we attach the context according to user configuration. In the
> basic use case, we hook to the port representor scheduling. This way we can use
> the OVS scheduler. When port representor rx-queue is called we forward the packets
> for it and account the cycles on the port representor (rx-queue), so OVS can
> rebalance if needed. This way we use the PMD thread empty cycles.
> If no port representor is added, we hook to the scheduler as a generic call.
> Every scheduling cycle we will call the HW virtio-forwarder. We limit the quota
> to avoid starvation of rx-queues. Although we cannot use the OVS scheduling
> features in this case, we still reuse most of the code of the forwarder and we
> solve the problem for kernel-OVS with minor additional effort.
> 
> From OVS perspective this is a HW offload functionality, no ports are added
> to the OVS. The functionality and statistics can be only accessed through the hw
> offload module and there is a minimal code change needed from OVS, mainly for
> hooking the calling context.

Can we just create a new netdev type like dpdkvdpa ?

Let me explain.
IIUC, in order to make vhost acceleration work we need 3 components:

1. vhost-user socket
2. vdpa device: real vdpa device or a SmartNIC VF.
3. representor of "vdpa device".

So, let's create a new OVS netdev 'dpdkvdpa'. It will have 3 mandatory
arguments:

1. vhost-server-path (ex.: /tmp/dpdkvhostvm1)
2. vhost-accelerator-devargs (ex.: "<vdpa pci id>,vdpa=1", or "<VF pci id>")
3. dpdk-devargs (ex.: "<vdpa pci id>,representor=[id]", or "<PF pci id>,representor=[id]")

And one optional config "accelerator-type=(hw|sw)".

In case of real VDPA device:
----------------------------
ovs-vsctl add-port vdpa0 br0 -- \
      set interface vdpa0 type=dpdkvdpa \
                          vhost-server-path=/tmp/dpdkvhostvm1 \
                          vhost-accelerator-devargs="<vdpa pci id>,vdpa=1" \
                          dpdk-devargs="<vdpa pci id>,representor=[id]" \
                          accelerator-type=hw

On this command OVS will create a new netdev:
1. Register vhost-user-client device with rte_vhost_driver_register().
2. Attach VDPA device to vhost-user with rte_vhost_driver_attach_vdpa_device().
3. Open and configure representor dpdk port.

netdev_rxq_recv() will just receive packets from the representor.
netdev_send() will just send packets to the representor.
HW offloading will install flows to the representor.

In case of VF pretending to be VDPA device:
-------------------------------------------
ovs-vsctl add-port vdpa0 br0 -- \
      set interface vdpa0 type=dpdkvdpa \
                          vhost-server-path=/tmp/dpdkvhostvm1 \
                          vhost-accelerator-devargs="<VF pci id>" \
                          dpdk-devargs="<VF pci id>,representor=[id]" \
                          accelerator-type=sw

On this command OVS will create a new netdev:
1. Register vhost-user-client device with rte_vhost_driver_register().
2. Open and configure VF dpdk port.
3. Open and configure representor dpdk port.

netdev_rxq_recv() will:
* Receive packets from VF and push to vhost-user.
* Receive packets from vhost-user and push to VF.
* Receive packets from representor and return them to the caller.

netdev_send() will just send packets to the representor.
HW offloading will install flows to the representor.

Above approach will allow us to not have any hooks/dirty hacks/separate appctls.
Also there will be no resources dangling in OVS that needs separate management.
All the code will be localized inside netdev-dpdk.c and hopefully mostly reuse
existing code.

What do you think?

Best regards, Ilya Maximets.