[ovs-dev] [PATCHv14 2/2] netdev-afxdp: add new netdev type for AF_XDP.

Ilya Maximets i.maximets at samsung.com
Fri Jul 5 15:32:42 UTC 2019


On 01.07.2019 19:08, William Tu wrote:
> The patch introduces experimental AF_XDP support for OVS netdev.
> AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
> type built upon the eBPF and XDP technology.  It is aims to have comparable
> performance to DPDK but cooperate better with existing kernel's networking
> stack.  An AF_XDP socket receives and sends packets from an eBPF/XDP program
> attached to the netdev, by-passing a couple of Linux kernel's subsystems
> As a result, AF_XDP socket shows much better performance than AF_PACKET
> For more details about AF_XDP, please see linux kernel's
> Documentation/networking/af_xdp.rst. Note that by default, this feature is
> not compiled in.
> 
> Signed-off-by: William Tu <u9012063 at gmail.com>
> ---
> 
> v13-v14
> - Mainly address issue reported by Ilya
>   https://protect2.fireeye.com/url?k=a0e04e091a64b944.a0e1c546-0896b5863118ce25&u=https://patchwork.ozlabs.org/patch/1118972/
>   when doing 'make check-afxdp'
> - Fix xdp frame headroom issue
> - Fix vlan test cases by disabling txvlan offload
> - Skip cvlan
> - Document TCP limitation (currently all tcp tests fail due to
>   kernel veth driver)

Maybe it's better to skip them too? The easy way could be:

diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at
index 5ee2ceb1a..f0683c0a9 100644
--- a/tests/system-afxdp-macros.at
+++ b/tests/system-afxdp-macros.at
@@ -30,3 +30,10 @@ m4_define([CONFIGURE_VETH_OFFLOADS],
      AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore])
     ]
 )
+
+# OVS_START_L7([namespace], [protocol])
+#
+# AF_XDP doesn't work with TCP over virtual interfaces for now.
+#
+m4_define([OVS_START_L7],
+   [AT_SKIP_IF([:])])
---

BTW, documentation should stay.

> - Fix tunnel test cases due to --disable-system (another patch)
> - Switch to use pthread_spin_lock, suggested by Ben
> - Add coverage counter for debugging
> - Fix buffer starvation issue at batch_send reported by Eelco
>   when using tap device with type=afxdp
> - TODO:
>   'make check-afxdp' still not all pass
>   IP fragmentation expiry test not fix yet, need to implement
>   deferral memory free, s.t like dpdk_mp_sweep.  Currently hit
>   some missing umem descs when reclaiming.
> ---

Instead of inline comments I'll suggest incremental patch (below) with
following explanations:

* You still need to return EAGAIN in rxq_recv if socket is not ready.
  Otherwise upper layers will count the call as successful.

* sendto() in SKB mode will send at most 16 packets, but our batch might
  be up to 32 packets. If batch size will be larger than 16, TX ring will
  be quickly exhausted and transmission will be blocked. We have to kick
  the kernel up to n_packets / 16 times to be sure that all packets are
  transmitted. This fixes my issues with zero traffic in test with iperf
  between 2 namespaces while more than 2 traffic flows involved.

* EAGAIN is very likely case for sendto.

* Suggesting to reclaim all CONS_NUM_DESCS addresses from the completion
  queue each time instead of BATCH_SIZE addresses. This will guarantee
  that we'll have enough cq entries for all cases. This is important
  because we may have much more than BATCH_SIZE packets in outstanding_tx.

* No need to kick_tx inside afxdp_complete_tx since we'll retry more times
  while kicking on transmit.

* Suggesting warning on exhausted memory pool. This is a bad case and should
  be reported.

* 'concurrent_txq' is a likely case, because we can't configure tx queues
  separately from rx queues, so we'll likely have number of tx queues less
  than number of threads --> 'concurrent_txq' will likely be true.
  I think that we need to just remove 'OVS_UNLIKELY' macro there.

* The worst case for the mpool size is larger than just size of the rings.
  There might be more packets currently under processing in OVS, stored
  in output batches in datapath. It's hard to estimate the good size.
  Suggesting just to increase two times for now.


diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c
index 44eb278e7..d5992f033 100644
--- a/lib/netdev-afxdp.c
+++ b/lib/netdev-afxdp.c
@@ -571,7 +571,7 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
 
     xsk_info = dev->xsks[qid];
     if (!xsk_info || !xsk_info->xsk) {
-        return 0;
+        return EAGAIN;
     }
 
     prepare_fill_queue(xsk_info);
@@ -628,21 +628,33 @@ netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
 }
 
 static inline int
-kick_tx(struct xsk_socket_info *xsk_info)
+kick_tx(struct xsk_socket_info *xsk_info, int xdpmode)
 {
-    int ret;
-
+    int ret, retries;
+    static const int KERNEL_TX_BATCH_SIZE = 16;
+
+    /* In SKB_MODE packet transmission is synchronous, and the kernel xmits
+     * only TX_BATCH_SIZE(16) packets for a single sendmsg syscall.
+     * So, we have to kick the kernel (n_packets / 16) times to be sure that
+     * all packets are transmitted. */
+    retries = (xdpmode == XDP_COPY)
+              ? xsk_info->outstanding_tx / KERNEL_TX_BATCH_SIZE
+              : 0;
+kick_retry:
     /* This causes system call into kernel's xsk_sendmsg, and
      * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode).
      */
     ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT,
                                 NULL, 0);
-    if (OVS_UNLIKELY(ret < 0)) {
+    if (ret < 0) {
+        if (retries-- && errno == EAGAIN) {
+            goto kick_retry;
+        }
         if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) {
             return errno;
         }
     }
-    /* no error, or EBUSY or EAGAIN */
+    /* No error, or EBUSY, or too many retries on EAGAIN. */
     return 0;
 }
 
@@ -715,7 +727,7 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
     int tx_done, j;
 
     umem = xsk_info->umem;
-    tx_done = xsk_ring_cons__peek(&umem->cq, BATCH_SIZE, &idx_cq);
+    tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq);
 
     /* Recycle back to umem pool */
     for (j = 0; j < tx_done; j++) {
@@ -733,19 +745,19 @@ afxdp_complete_tx(struct xsk_socket_info *xsk_info)
         elems_push[tx_to_free] = elem;
         *addr = UINT64_MAX; /* Mark as pushed */
         tx_to_free++;
-    }
 
-    umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
+        if (tx_to_free == BATCH_SIZE || j == tx_done - 1) {
+            umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push);
+            xsk_info->outstanding_tx -= tx_to_free;
+            tx_to_free = 0;
+        }
+    }
 
     if (tx_done > 0) {
         xsk_ring_cons__release(&umem->cq, tx_done);
     } else {
         COVERAGE_INC(afxdp_cq_empty);
-        if (xsk_info->outstanding_tx) {
-            kick_tx(xsk_info);
-        }
     }
-    xsk_info->outstanding_tx -= tx_to_free;
 }
 
 static inline int
@@ -775,6 +787,8 @@ __netdev_afxdp_batch_send(struct netdev *netdev, int qid,
     ret = umem_elem_pop_n(&umem->mpool, batch->count, (void **)elems_pop);
     if (OVS_UNLIKELY(ret)) {
         xsk_info->tx_dropped += batch->count;
+        VLOG_WARN_RL(&rl, "%s: send failed due to exhausted memory pool",
+                     netdev_get_name(netdev));
         error = ENOMEM;
         goto out;
     }
@@ -784,9 +798,9 @@ __netdev_afxdp_batch_send(struct netdev *netdev, int qid,
     if (OVS_UNLIKELY(ret == 0)) {
         umem_elem_push_n(&umem->mpool, batch->count, (void **)elems_pop);
         xsk_info->tx_dropped += batch->count;
-        afxdp_complete_tx(xsk_info);
         COVERAGE_INC(afxdp_tx_full);
-        kick_tx(xsk_info);
+        afxdp_complete_tx(xsk_info);
+        kick_tx(xsk_info, dev->xdpmode);
         error = ENOMEM;
         goto out;
     }
@@ -810,10 +824,10 @@ __netdev_afxdp_batch_send(struct netdev *netdev, int qid,
     xsk_ring_prod__submit(&xsk_info->tx, batch->count);
     xsk_info->outstanding_tx += batch->count;
 
-    ret = kick_tx(xsk_info);
+    ret = kick_tx(xsk_info, dev->xdpmode);
     if (OVS_UNLIKELY(ret)) {
-        VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s",
-                     ovs_strerror(ret));
+        VLOG_WARN_RL(&rl, "%s: error sending AF_XDP packet: %s",
+                     netdev_get_name(netdev), ovs_strerror(ret));
     }
 
 out:
@@ -834,7 +848,7 @@ netdev_afxdp_batch_send(struct netdev *netdev, int qid,
     struct netdev_linux *dev;
     int ret;
 
-    if (OVS_UNLIKELY(concurrent_txq)) {
+    if (concurrent_txq) {
         dev = netdev_linux_cast(netdev);
         qid = qid % dev->up.n_txq;
 
diff --git a/lib/xdpsock.h b/lib/xdpsock.h
index 78759961d..88e9163ea 100644
--- a/lib/xdpsock.h
+++ b/lib/xdpsock.h
@@ -38,16 +38,19 @@
 #define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS
 #define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS
 
-/* The worst case is all 4 queues TX/CQ/RX/FILL are full.
- * Setting NUM_FRAMES to this makes sure umem_pop always successes.
+/* The worst case is all 4 queues TX/CQ/RX/FILL are full + some packets
+ * still on processing in threads. Number of packets currently in OVS
+ * processing is hard to estimate because it depends on number of ports.
+ * Setting NUM_FRAMES twice as large than total of ring sizes should be
+ * enough for most corner cases.
  */
-#define NUM_FRAMES      (2 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
+#define NUM_FRAMES      (4 * (PROD_NUM_DESCS + CONS_NUM_DESCS))
 
 #define BATCH_SIZE      NETDEV_MAX_BURST
 
 BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES));
 BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS);
-BUILD_ASSERT_DECL(NUM_FRAMES == 2 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
+BUILD_ASSERT_DECL(NUM_FRAMES == 4 * (PROD_NUM_DESCS + CONS_NUM_DESCS));
 
 /* LIFO ptr_array */
 struct umem_pool {
---


Few additional questions/requests:

* Why we need dev->xdp_flags and dev->xdp_bind_flags? These are not used and
  mostly just duplicates the xdpmode.

* Please, don't declare two structure fields at the same line:

  int xdpmode, requested_xdpmode;
  int xdp_flags, xdp_bind_flags;


Best regards, Ilya Maximets.


More information about the dev mailing list