[ovs-dev] [PATCH] RFC: netdev-afxdp: Support for XDP metadata HW hints.

William Tu u9012063 at gmail.com
Thu Mar 4 18:27:05 UTC 2021


One big problem of netdev-afxdp is that there is no metadata support
from the hardware at all.  For example, OVS netdev-afxdp has to do rxhash,
or TCP checksum in software, resulting in high performance overhead.

A generic meta data type for XDP frame using BTF is proposed[1] and
there is sample implementation[2][3].  This patch experiments enabling 
the XDP metadata, or called HW hints, and shows the potential performance
improvement.  The patch uses only the rxhash value provided from HW,
so avoiding at the calculation of hash at lib/dpif-netdev.c:
    if (!dp_packet_rss_valid(execute->packet)) {
        dp_packet_set_rss_hash(execute->packet,
                               flow_hash_5tuple(execute->flow, 0));
    }

Using '$ ovs-appctl dpif-netdev/pmd-stats-show', the 'avg processing
cycles per packet' drops from 402 to 272.  More details below

Reference:
----------
[1] https://www.kernel.org/doc/html/latest/bpf/btf.html
[2] https://netdevconf.info/0x14/pub/slides/54/[1]%20XDP%20meta%20data%20acceleration.pdf
[3] https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/log/?h=topic/xdp_metadata4

Testbed:
--------
Two Xeon E5-2620 v3 2.4GHz connected back-to-back using Mellanox
ConnectX-6Dx 25GbE. Before starting OVS, enable the MD by:
$ bpftool net xdp show
xdp:
enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(0)
enp2s0f1np1(5) md_btf_id(2) md_btf_enabled(0)
$ bpftool net xdp set dev enp2s0f0np0 md_btf on
$ bpftool net xdp
xdp:
enp2s0f0np0(4) md_btf_id(1) md_btf_enabled(1)

Limitations/TODO:
-----------------
1. Support only AF_XDP native mode, not zero-copy mode.
2. Currently only three fields: vlan, hash, and flow_mark, and only receive
   side supports XDP metadata.
3. Control plane, how to enable and probe the structure, not upstream yet.

OVS rxdrop without HW hints:
---------------------------
Drop rate: 4.8Mpps

pmd thread numa_id 0 core_id 3:
  packets received: 196592006
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 196592006
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 0.00
  idle cycles: 56009063835 (41.43%)
  processing cycles: 79164971931 (58.57%)
  avg cycles per packet: 687.59 (135174035766/196592006)
  avg processing cycles per packet: 402.69 (79164971931/196592006)

pmd thread numa_id 0 core_id 3:
  Iterations:           339607649  (0.23 us/it)
  - Used TSC cycles: 188620512777  ( 99.9 % of total cycles)
  - idle iterations:    330697002  ( 40.3 % of used cycles)
  - busy iterations:      8910647  ( 59.7 % of used cycles)
  Rx packets:           285140031  (3624 Kpps, 395 cycles/pkt)
  Datapath passes:      285140031  (1.00 passes/pkt)
  - EMC hits:           285139999  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
  - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:                   0

Perf report:
  17.56%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
  14.39%  pmd-c03/id:11  ovs-vswitchd        [.] dp_netdev_process_rxq_port
  14.17%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_thread_main
  10.86%  pmd-c03/id:11  [vdso]              [.] __vdso_clock_gettime
  10.19%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_end_iteration
   7.71%  pmd-c03/id:11  ovs-vswitchd        [.] time_timespec__
   5.64%  pmd-c03/id:11  ovs-vswitchd        [.] time_usec
   3.88%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_get_class
   2.95%  pmd-c03/id:11  ovs-vswitchd        [.] netdev_rxq_recv
   2.78%  pmd-c03/id:11  libbpf.so.0.2.0     [.] xsk_socket__fd
   2.74%  pmd-c03/id:11  ovs-vswitchd        [.] pmd_perf_start_iteration
   2.11%  pmd-c03/id:11  libc-2.27.so        [.] __clock_gettime
   1.32%  pmd-c03/id:11  ovs-vswitchd        [.] xsk_socket__fd at plt

OVS rxdrop with HW hints:
-------------------------
rxdrop rate: 4.73Mpps

pmd thread numa_id 0 core_id 7:
  packets received: 13686880
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 13686880
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00
  miss with success upcall: 0
  miss with failed upcall: 0
  avg. packets per output batch: 0.00
  idle cycles: 3182105544 (46.02%)
  processing cycles: 3732023844 (53.98%)
  avg cycles per packet: 505.16 (6914129388/13686880)
  avg processing cycles per packet: 272.67 (3732023844/13686880)

pmd thread numa_id 0 core_id 7:

  Iterations:           392909539  (0.18 us/it)
  - Used TSC cycles: 167697342678  ( 99.9 % of total cycles)
  - idle iterations:    382539861  ( 46.0 % of used cycles)
  - busy iterations:     10369678  ( 54.0 % of used cycles)
  Rx packets:           331829656  (4743 Kpps, 273 cycles/pkt)
  Datapath passes:      331829656  (1.00 passes/pkt)
  - EMC hits:           331829656  (100.0 %)
  - SMC hits:                   0  (  0.0 %)
  - Megaflow hits:              0  (  0.0 %, 0.00 subtbl lookups/hit)
  - Upcalls:                    0  (  0.0 %, 0.0 us/upcall)
  - Lost upcalls:               0  (  0.0 %)
  Tx packets:                   0

Perf record/report:
  22.96%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_afxdp_rxq_recv
  10.43%  pmd-c07/id:8  ovs-vswitchd        [.] miniflow_extract
   7.20%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_init__
   7.00%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_input__
   6.79%  pmd-c07/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
   6.62%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_thread_main
   5.65%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
   5.04%  pmd-c07/id:8  [vdso]              [.] __vdso_clock_gettime
   3.60%  pmd-c07/id:8  ovs-vswitchd        [.] time_timespec__
   3.10%  pmd-c07/id:8  ovs-vswitchd        [.] umem_elem_push
   2.74%  pmd-c07/id:8  libc-2.27.so        [.] __memcmp_avx2_movbe
   2.62%  pmd-c07/id:8  ovs-vswitchd        [.] time_usec
   2.14%  pmd-c07/id:8  ovs-vswitchd        [.] dp_packet_use_afxdp
   1.58%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_rxq_recv
   1.47%  pmd-c07/id:8  ovs-vswitchd        [.] netdev_get_class
   1.34%  pmd-c07/id:8  ovs-vswitchd        [.] pmd_perf_start_iteration

Signed-off-by: William Tu <u9012063 at gmail.com>
---
 lib/netdev-afxdp.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 482400d8d135..49881a8cc0cb 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -169,6 +169,17 @@ struct netdev_afxdp_tx_lock {
     );
 };
 
+/* FIXME:
+ * This should be done dynamically by query the device's
+ * XDP metadata structure. Ex:
+ *   $ bpftool net xdp md_btf cstyle dev enp2s0f0np0
+ */
+struct xdp_md_desc {
+    uint32_t flow_mark;
+    uint32_t hash32;
+    uint16_t vlan;
+};
+
 #ifdef HAVE_XDP_NEED_WAKEUP
 static inline void
 xsk_rx_wakeup_if_needed(struct xsk_umem_info *umem,
@@ -849,6 +860,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         struct dp_packet_afxdp *xpacket;
         const struct xdp_desc *desc;
         struct dp_packet *packet;
+        struct xdp_md_desc *md;
         uint64_t addr, index;
         uint32_t len;
         char *pkt;
@@ -858,6 +870,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
         len = desc->len;
 
         pkt = xsk_umem__get_data(umem->buffer, addr);
+        md = pkt - sizeof *md;
         index = addr >> FRAME_SHIFT;
         xpacket = &umem->xpool.array[index];
         packet = &xpacket->packet;
@@ -868,6 +881,12 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
                             OVS_XDP_HEADROOM);
         dp_packet_set_size(packet, len);
 
+        /* FIXME: This should be done by detecting whether
+         * XDP MD is enabled or not. Ex:
+         * $ bpftool net xdp set dev enp2s0f0np0 md_btf on
+         */
+        dp_packet_set_rss_hash(packet, md->hash32);
+
         /* Add packet into batch, increase batch->count. */
         dp_packet_batch_add(batch, packet);
 
-- 
2.7.4



More information about the dev mailing list