DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling
@ 2021-02-18  9:23 Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-18  9:23 UTC (permalink / raw)
  To: dev

Single-core performance of AF_XDP at high loads can be poor because
a heavily loaded NAPI context will never enter or allow for busy-polling.

1C testpmd rxonly (both IRQs and PMD on core 0):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 -- --forward-mode=rxonly
0.088Mpps

In order to achieve decent performance at high loads, it is currently
recommended ensure the IRQs for the netdev and the core running the PMD are
different.

2C testpmd rxonly (IRQs on core 0, PMD on core 1):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=0 -- \
--forward-mode=rxonly
19.26Mpps

However using an extra core is of course not ideal. The SO_PREFER_BUSY_POLL
socket option was introduced in kernel v5.11 to help improve 1C performance.
See [1].

This series sets this socket option on xsks created with DPDK (ie. instances of
the AF_XDP PMD) unless explicitly disabled or not supported by the kernel. It
was chosen to be enabled by default in order to bring the AF_XDP PMD in line
with most other PMDs which execute on a single core.

The following system and netdev settings are recommended in conjunction with
busy polling:
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

With the RFC these must be manually configured, but for the v1 these may be
set from the PMD since the performance is tightly coupled with these settings.

Re-running the 1C test with busy polling support and the above settings:
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 -- --forward-mode=rxonly
10.45Mpps

A new vdev arg is introduced called 'busy_budget' whose default value is 64.
busy_budget is the value supplied to the kernel with the SO_BUSY_POLL_BUDGET
socket option and represents the busy-polling NAPI budget ie. the number of
packets the kernel will attempt to process in the netdev's NAPI context. If set
to 0, preferred busy polling is disabled.

To set the busy budget to 256:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=256
14.06Mpps

To set the busy budget to 512:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=512
14.32Mpps

To disable preferred busy polling:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=0

[1] https://lwn.net/Articles/837010/


Ciara Loftus (3):
  net/af_xdp: Increase max batch size to 512
  net/af_xdp: Use recvfrom() instead of poll()
  net/af_xdp: preferred busy polling

 doc/guides/nics/af_xdp.rst          | 38 +++++++++++-
 drivers/net/af_xdp/compat.h         | 13 ++++
 drivers/net/af_xdp/rte_eth_af_xdp.c | 95 ++++++++++++++++++++++-------
 3 files changed, 124 insertions(+), 22 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH RFC 1/3] net/af_xdp: Increase max batch size to 512
  2021-02-18  9:23 [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
@ 2021-02-18  9:23 ` Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-18  9:23 UTC (permalink / raw)
  To: dev

Prior to this the max size was 32 which was unnecessarily
small. Also enforce the max batch size for TX for both
copy and zero copy modes. Prior to this only copy mode
enforced the max size.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index b8d5ad0d91..b51db90204 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
 
-#define ETH_AF_XDP_RX_BATCH_SIZE	32
-#define ETH_AF_XDP_TX_BATCH_SIZE	32
+#define ETH_AF_XDP_RX_BATCH_SIZE	512
+#define ETH_AF_XDP_TX_BATCH_SIZE	512
 
 
 struct xsk_umem_info {
@@ -535,8 +535,6 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint32_t idx_tx;
 	struct xsk_ring_cons *cq = &txq->pair->cq;
 
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
-
 	pull_umem_cq(umem, nb_pkts, cq);
 
 	nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs,
@@ -580,6 +578,8 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 static uint16_t
 eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
+	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
+
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_tx_zc(queue, bufs, nb_pkts);
 #else
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH RFC 2/3] net/af_xdp: Use recvfrom() instead of poll()
  2021-02-18  9:23 [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
@ 2021-02-18  9:23 ` Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 3/3] net/af_xdp: preferred busy polling Ciara Loftus
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-18  9:23 UTC (permalink / raw)
  To: dev

poll() is more expensive and requires more tuning
when used with the upcoming busy polling functionality.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index b51db90204..34b15aa3d0 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -263,9 +263,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
-
 		return 0;
 	}
 
@@ -336,7 +336,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 		return 0;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH RFC 3/3] net/af_xdp: preferred busy polling
  2021-02-18  9:23 [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
@ 2021-02-18  9:23 ` Ciara Loftus
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-18  9:23 UTC (permalink / raw)
  To: dev

This commit introduces support for preferred busy polling
to the AF_XDP PMD. This feature aims to improve single-core
performance for AF_XDP sockets under heavy load.

A new vdev arg is introduced called 'busy_budget' whose default
value is 64. busy_budget is the value supplied to the kernel
with the SO_BUSY_POLL_BUDGET socket option and represents the
busy-polling NAPI budget. To set the budget to a different value
eg. 256:

--vdev=net_af_xdp0,iface=eth0,busy_budget=256

Preferred busy polling is enabled by default provided a kernel with
version >= v5.11 is in use. To disable it, set the budget to zero.

The following settings are also strongly recommended to be used in
conjunction with this feature:

echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

.. where eth0 is the interface being used by the PMD.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 doc/guides/nics/af_xdp.rst          | 38 +++++++++++++-
 drivers/net/af_xdp/compat.h         | 13 +++++
 drivers/net/af_xdp/rte_eth_af_xdp.c | 80 ++++++++++++++++++++++++-----
 3 files changed, 116 insertions(+), 15 deletions(-)

diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst
index 5ed24374f8..8bf40b5f0f 100644
--- a/doc/guides/nics/af_xdp.rst
+++ b/doc/guides/nics/af_xdp.rst
@@ -35,6 +35,7 @@ The following options can be provided to set up an af_xdp port in DPDK.
 *   ``shared_umem`` - PMD will attempt to share UMEM with others (optional,
     default 0);
 *   ``xdp_prog`` - path to custom xdp program (optional, default none);
+*   ``busy_budget`` - busy polling budget (optional, default 64);
 
 Prerequisites
 -------------
@@ -51,6 +52,7 @@ This is a Linux-specific PMD, thus the following prerequisites apply:
 *  For shared_umem, it requires kernel version v5.10 or later and libbpf version
    v0.2.0 or later.
 *  For 32-bit OS, a kernel with version 5.4 or later is required.
+*  For busy polling, kernel version v5.11 or later is required.
 
 Set up an af_xdp interface
 -----------------------------
@@ -107,4 +109,38 @@ Limitations
   .. code-block:: console
 
     --vdev net_af_xdp0,iface=ens786f1,shared_umem=1 \
-    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
\ No newline at end of file
+    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
+
+- **Preferred Busy Polling**
+
+  The SO_PREFER_BUSY_POLL socket option was introduced in kernel v5.11. It can
+  deliver a performance improvement for sockets with heavy traffic loads and
+  can significantly improve single-core performance in this context.
+
+  The feature is enabled by default in the AF_XDP PMD. To disable it, set the
+  'busy_budget' vdevarg to zero:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=0
+
+  The default 'busy_budget' is 64 and it represents the number of packets the
+  kernel will attempt to process in the netdev's NAPI context. You can change
+  the value for example to 256 like so:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=256
+
+  It is also strongly recommended to set the following for optimal performance:
+
+  .. code-block:: console
+
+    echo 2 | sudo tee /sys/class/net/ens786f1/napi_defer_hard_irqs
+    echo 200000 | sudo tee /sys/class/net/ens786f1/gro_flush_timeout
+
+  The above defers interrupts for interface ens786f1 and instead schedules its
+  NAPI context from a watchdog timer instead of from softirqs. More information
+  on this feature can be found at [1].
+
+  [1] https://lwn.net/Articles/837010/
\ No newline at end of file
diff --git a/drivers/net/af_xdp/compat.h b/drivers/net/af_xdp/compat.h
index 7aa40d522e..1d247a50b2 100644
--- a/drivers/net/af_xdp/compat.h
+++ b/drivers/net/af_xdp/compat.h
@@ -39,3 +39,16 @@ create_shared_socket(struct xsk_socket **xsk_ptr __rte_unused,
 	return -1;
 }
 #endif
+
+#ifdef XDP_USE_NEED_WAKEUP
+static int
+syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
+{
+	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
+}
+#else
+syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t busy_budget)
+{
+	return busy_budget;
+}
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 34b15aa3d0..c586f3042a 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -65,6 +65,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_NUM_DESCS	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
+#define ETH_AF_XDP_DFLT_BUSY_BUDGET	64
+#define ETH_AF_XDP_DFLT_BUSY_TIMEOUT	20
 
 #define ETH_AF_XDP_RX_BATCH_SIZE	512
 #define ETH_AF_XDP_TX_BATCH_SIZE	512
@@ -100,6 +102,7 @@ struct pkt_rx_queue {
 	struct pkt_tx_queue *pair;
 	struct pollfd fds[1];
 	int xsk_queue_idx;
+	uint32_t busy_budget;
 };
 
 struct tx_stats {
@@ -140,6 +143,7 @@ struct pmd_internals {
 #define ETH_AF_XDP_QUEUE_COUNT_ARG		"queue_count"
 #define ETH_AF_XDP_SHARED_UMEM_ARG		"shared_umem"
 #define ETH_AF_XDP_PROG_ARG			"xdp_prog"
+#define ETH_AF_XDP_BUDGET_ARG			"busy_budget"
 
 static const char * const valid_arguments[] = {
 	ETH_AF_XDP_IFACE_ARG,
@@ -147,6 +151,7 @@ static const char * const valid_arguments[] = {
 	ETH_AF_XDP_QUEUE_COUNT_ARG,
 	ETH_AF_XDP_SHARED_UMEM_ARG,
 	ETH_AF_XDP_PROG_ARG,
+	ETH_AF_XDP_BUDGET_ARG,
 	NULL
 };
 
@@ -261,11 +266,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 		return 0;
 	}
 
@@ -334,11 +337,9 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 		return 0;
 	}
 
@@ -422,9 +423,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 
 	pull_umem_cq(umem, XSK_RING_CONS__DEFAULT_NUM_DESCS, cq);
 
-#if defined(XDP_USE_NEED_WAKEUP)
-	if (xsk_ring_prod__needs_wakeup(&txq->tx))
-#endif
+	if (syscall_needed(&txq->tx, txq->pair->busy_budget)) {
 		while (send(xsk_socket__fd(txq->pair->xsk), NULL,
 			    0, MSG_DONTWAIT) < 0) {
 			/* some thing unexpected */
@@ -437,6 +436,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 					     XSK_RING_CONS__DEFAULT_NUM_DESCS,
 					     cq);
 		}
+	}
 }
 
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
@@ -1145,6 +1145,34 @@ xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 		goto err;
 	}
 
+#ifdef SO_PREFER_BUSY_POLL
+	int sock_opt = 1;
+	int fd = xsk_socket__fd(rxq->xsk);
+
+	if (setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt)) < 0) {
+		AF_XDP_LOG(ERR, "Failed to set SO_PREFER_BUSY_POLL\n");
+		goto err;
+	}
+
+	sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
+	if (setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt)) < 0) {
+		AF_XDP_LOG(ERR, "Failed to set SO_BUSY_POLL\n");
+		goto err;
+	}
+
+	sock_opt = rxq->busy_budget;
+	if (setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET, (void *)&sock_opt,
+			sizeof(sock_opt)) < 0) {
+		AF_XDP_LOG(ERR, "Failed to set SO_BUSY_POLL_BUDGET\n");
+		goto err;
+	} else {
+		AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
+					rxq->busy_budget);
+	}
+#endif
+
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	if (rte_pktmbuf_alloc_bulk(rxq->umem->mb_pool, fq_bufs, reserve_size)) {
 		AF_XDP_LOG(DEBUG, "Failed to get enough buffers for fq.\n");
@@ -1416,7 +1444,8 @@ xdp_get_channels_info(const char *if_name, int *max_queues,
 
 static int
 parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
-			int *queue_cnt, int *shared_umem, char *prog_path)
+			int *queue_cnt, int *shared_umem, char *prog_path,
+			int *busy_budget)
 {
 	int ret;
 
@@ -1447,6 +1476,11 @@ parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
 	if (ret < 0)
 		goto free_kvlist;
 
+	ret = rte_kvargs_process(kvlist, ETH_AF_XDP_BUDGET_ARG,
+				&parse_integer_arg, busy_budget);
+	if (ret < 0)
+		goto free_kvlist;
+
 free_kvlist:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -1485,7 +1519,7 @@ get_iface_info(const char *if_name,
 static struct rte_eth_dev *
 init_internals(struct rte_vdev_device *dev, const char *if_name,
 		int start_queue_idx, int queue_cnt, int shared_umem,
-		const char *prog_path)
+		const char *prog_path, int busy_budget)
 {
 	const char *name = rte_vdev_device_name(dev);
 	const unsigned int numa_node = dev->device.numa_node;
@@ -1546,6 +1580,7 @@ init_internals(struct rte_vdev_device *dev, const char *if_name,
 		internals->rx_queues[i].pair = &internals->tx_queues[i];
 		internals->rx_queues[i].xsk_queue_idx = start_queue_idx + i;
 		internals->tx_queues[i].xsk_queue_idx = start_queue_idx + i;
+		internals->rx_queues[i].busy_budget = busy_budget;
 	}
 
 	ret = get_iface_info(if_name, &internals->eth_addr,
@@ -1589,6 +1624,7 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 	int xsk_queue_cnt = ETH_AF_XDP_DFLT_QUEUE_COUNT;
 	int shared_umem = 0;
 	char prog_path[PATH_MAX] = {'\0'};
+	int busy_budget = -1;
 	struct rte_eth_dev *eth_dev = NULL;
 	const char *name;
 
@@ -1618,7 +1654,8 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		dev->device.numa_node = rte_socket_id();
 
 	if (parse_parameters(kvlist, if_name, &xsk_start_queue_idx,
-			     &xsk_queue_cnt, &shared_umem, prog_path) < 0) {
+			     &xsk_queue_cnt, &shared_umem, prog_path,
+			     &busy_budget) < 0) {
 		AF_XDP_LOG(ERR, "Invalid kvargs value\n");
 		return -EINVAL;
 	}
@@ -1628,8 +1665,22 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		return -EINVAL;
 	}
 
+#ifdef SO_PREFER_BUSY_POLL
+	busy_budget = busy_budget == -1 ? ETH_AF_XDP_DFLT_BUSY_BUDGET :
+					busy_budget;
+	if (!busy_budget)
+		AF_XDP_LOG(ERR, "Preferred busy polling disabled\n");
+#else
+	if (busy_budget > 0) {
+		AF_XDP_LOG(ERR, "Kernel does not support SO_PREFER_BUSY_POLL\n");
+		return -ENOTSUP;
+	}
+	busy_budget = 0;
+#endif
+
 	eth_dev = init_internals(dev, if_name, xsk_start_queue_idx,
-					xsk_queue_cnt, shared_umem, prog_path);
+					xsk_queue_cnt, shared_umem, prog_path,
+					busy_budget);
 	if (eth_dev == NULL) {
 		AF_XDP_LOG(ERR, "Failed to init internals\n");
 		return -1;
@@ -1674,4 +1725,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
 			      "start_queue=<int> "
 			      "queue_count=<int> "
 			      "shared_umem=<int> "
-			      "xdp_prog=<string> ");
+			      "xdp_prog=<string> "
+			      "busy_budget=<int>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling
  2021-02-18  9:23 [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
                   ` (2 preceding siblings ...)
  2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-02-24 11:18 ` Ciara Loftus
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
                     ` (3 more replies)
  3 siblings, 4 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-24 11:18 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Single-core performance of AF_XDP at high loads can be poor because
a heavily loaded NAPI context will never enter or allow for busy-polling.

1C testpmd rxonly (both IRQs and PMD on core 0):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
0.088Mpps

In order to achieve decent performance at high loads, it is currently
recommended ensure the IRQs for the netdev queue and the core running
the PMD are different.

2C testpmd rxonly (IRQs on core 0, PMD on core 1):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=0 -- \
--forward-mode=rxonly
19.26Mpps

However using an extra core is of course not ideal. The SO_PREFER_BUSY_POLL
socket option was introduced in kernel v5.11 to help improve 1C performance.
See [1].

This series sets this socket option on xsks created with DPDK (ie. instances of
the AF_XDP PMD) unless explicitly disabled or not supported by the kernel. It
was chosen to be enabled by default in order to bring the AF_XDP PMD in line
with most other PMDs which execute on a single core.

The following system and netdev settings are recommended in conjunction with
busy polling:
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

Re-running the 1C test with busy polling support and the above settings:
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
10.45Mpps

A new vdev arg is introduced called 'busy_budget' whose default value is 64.
busy_budget is the value supplied to the kernel with the SO_BUSY_POLL_BUDGET
socket option and represents the busy-polling NAPI budget ie. the number of
packets the kernel will attempt to process in the netdev's NAPI context.

To set the busy budget to 256:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=256
14.06Mpps

If you still wish to run using 2 cores (one for PMD once for IRQs) it is
recommended to disable busy polling to achieve optimal 2C performance:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=0
19.09Mpps

RFC->v1:
* Fixed behaviour of busy_budget=0
* Ensure we bail out if any of the new setsockopts fail

[1] https://lwn.net/Articles/837010/

Ciara Loftus (3):
  net/af_xdp: Increase max batch size to 512
  net/af_xdp: Use recvfrom() instead of poll()
  net/af_xdp: preferred busy polling

 doc/guides/nics/af_xdp.rst          |  38 ++++++++++-
 drivers/net/af_xdp/compat.h         |  13 ++++
 drivers/net/af_xdp/rte_eth_af_xdp.c | 100 ++++++++++++++++++++++------
 3 files changed, 129 insertions(+), 22 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
@ 2021-02-24 11:18   ` Ciara Loftus
  2021-03-01 16:31     ` Ferruh Yigit
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Ciara Loftus @ 2021-02-24 11:18 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Prior to this the max size was 32 which was unnecessarily
small. Also enforce the max batch size for TX for both
copy and zero copy modes. Prior to this only copy mode
enforced the max size.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index b8d5ad0d91..b51db90204 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
 
-#define ETH_AF_XDP_RX_BATCH_SIZE	32
-#define ETH_AF_XDP_TX_BATCH_SIZE	32
+#define ETH_AF_XDP_RX_BATCH_SIZE	512
+#define ETH_AF_XDP_TX_BATCH_SIZE	512
 
 
 struct xsk_umem_info {
@@ -535,8 +535,6 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint32_t idx_tx;
 	struct xsk_ring_cons *cq = &txq->pair->cq;
 
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
-
 	pull_umem_cq(umem, nb_pkts, cq);
 
 	nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs,
@@ -580,6 +578,8 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 static uint16_t
 eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
+	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
+
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_tx_zc(queue, bufs, nb_pkts);
 #else
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 2/3] net/af_xdp: Use recvfrom() instead of poll()
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
@ 2021-02-24 11:18   ` Ciara Loftus
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling Ciara Loftus
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-02-24 11:18 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

poll() is more expensive and requires more tuning
when used with the upcoming busy polling functionality.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index b51db90204..34b15aa3d0 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -263,9 +263,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
-
 		return 0;
 	}
 
@@ -336,7 +336,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 		return 0;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
@ 2021-02-24 11:18   ` Ciara Loftus
  2021-03-01 16:32     ` Ferruh Yigit
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 1 reply; 27+ messages in thread
From: Ciara Loftus @ 2021-02-24 11:18 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

This commit introduces support for preferred busy polling
to the AF_XDP PMD. This feature aims to improve single-core
performance for AF_XDP sockets under heavy load.

A new vdev arg is introduced called 'busy_budget' whose default
value is 64. busy_budget is the value supplied to the kernel
with the SO_BUSY_POLL_BUDGET socket option and represents the
busy-polling NAPI budget. To set the budget to a different value
eg. 256:

--vdev=net_af_xdp0,iface=eth0,busy_budget=256

Preferred busy polling is enabled by default provided a kernel with
version >= v5.11 is in use. To disable it, set the budget to zero.

The following settings are also strongly recommended to be used in
conjunction with this feature:

echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

.. where eth0 is the interface being used by the PMD.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 doc/guides/nics/af_xdp.rst          | 38 ++++++++++++-
 drivers/net/af_xdp/compat.h         | 13 +++++
 drivers/net/af_xdp/rte_eth_af_xdp.c | 85 ++++++++++++++++++++++++-----
 3 files changed, 121 insertions(+), 15 deletions(-)

diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst
index 5ed24374f8..8bf40b5f0f 100644
--- a/doc/guides/nics/af_xdp.rst
+++ b/doc/guides/nics/af_xdp.rst
@@ -35,6 +35,7 @@ The following options can be provided to set up an af_xdp port in DPDK.
 *   ``shared_umem`` - PMD will attempt to share UMEM with others (optional,
     default 0);
 *   ``xdp_prog`` - path to custom xdp program (optional, default none);
+*   ``busy_budget`` - busy polling budget (optional, default 64);
 
 Prerequisites
 -------------
@@ -51,6 +52,7 @@ This is a Linux-specific PMD, thus the following prerequisites apply:
 *  For shared_umem, it requires kernel version v5.10 or later and libbpf version
    v0.2.0 or later.
 *  For 32-bit OS, a kernel with version 5.4 or later is required.
+*  For busy polling, kernel version v5.11 or later is required.
 
 Set up an af_xdp interface
 -----------------------------
@@ -107,4 +109,38 @@ Limitations
   .. code-block:: console
 
     --vdev net_af_xdp0,iface=ens786f1,shared_umem=1 \
-    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
\ No newline at end of file
+    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
+
+- **Preferred Busy Polling**
+
+  The SO_PREFER_BUSY_POLL socket option was introduced in kernel v5.11. It can
+  deliver a performance improvement for sockets with heavy traffic loads and
+  can significantly improve single-core performance in this context.
+
+  The feature is enabled by default in the AF_XDP PMD. To disable it, set the
+  'busy_budget' vdevarg to zero:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=0
+
+  The default 'busy_budget' is 64 and it represents the number of packets the
+  kernel will attempt to process in the netdev's NAPI context. You can change
+  the value for example to 256 like so:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=256
+
+  It is also strongly recommended to set the following for optimal performance:
+
+  .. code-block:: console
+
+    echo 2 | sudo tee /sys/class/net/ens786f1/napi_defer_hard_irqs
+    echo 200000 | sudo tee /sys/class/net/ens786f1/gro_flush_timeout
+
+  The above defers interrupts for interface ens786f1 and instead schedules its
+  NAPI context from a watchdog timer instead of from softirqs. More information
+  on this feature can be found at [1].
+
+  [1] https://lwn.net/Articles/837010/
\ No newline at end of file
diff --git a/drivers/net/af_xdp/compat.h b/drivers/net/af_xdp/compat.h
index 7aa40d522e..1d247a50b2 100644
--- a/drivers/net/af_xdp/compat.h
+++ b/drivers/net/af_xdp/compat.h
@@ -39,3 +39,16 @@ create_shared_socket(struct xsk_socket **xsk_ptr __rte_unused,
 	return -1;
 }
 #endif
+
+#ifdef XDP_USE_NEED_WAKEUP
+static int
+syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
+{
+	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
+}
+#else
+syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t busy_budget)
+{
+	return busy_budget;
+}
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 34b15aa3d0..5091239b38 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -65,6 +65,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_NUM_DESCS	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
+#define ETH_AF_XDP_DFLT_BUSY_BUDGET	64
+#define ETH_AF_XDP_DFLT_BUSY_TIMEOUT	20
 
 #define ETH_AF_XDP_RX_BATCH_SIZE	512
 #define ETH_AF_XDP_TX_BATCH_SIZE	512
@@ -100,6 +102,7 @@ struct pkt_rx_queue {
 	struct pkt_tx_queue *pair;
 	struct pollfd fds[1];
 	int xsk_queue_idx;
+	uint32_t busy_budget;
 };
 
 struct tx_stats {
@@ -140,6 +143,7 @@ struct pmd_internals {
 #define ETH_AF_XDP_QUEUE_COUNT_ARG		"queue_count"
 #define ETH_AF_XDP_SHARED_UMEM_ARG		"shared_umem"
 #define ETH_AF_XDP_PROG_ARG			"xdp_prog"
+#define ETH_AF_XDP_BUDGET_ARG			"busy_budget"
 
 static const char * const valid_arguments[] = {
 	ETH_AF_XDP_IFACE_ARG,
@@ -147,6 +151,7 @@ static const char * const valid_arguments[] = {
 	ETH_AF_XDP_QUEUE_COUNT_ARG,
 	ETH_AF_XDP_SHARED_UMEM_ARG,
 	ETH_AF_XDP_PROG_ARG,
+	ETH_AF_XDP_BUDGET_ARG,
 	NULL
 };
 
@@ -261,11 +266,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 		return 0;
 	}
 
@@ -334,11 +337,9 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 		return 0;
 	}
 
@@ -422,9 +423,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 
 	pull_umem_cq(umem, XSK_RING_CONS__DEFAULT_NUM_DESCS, cq);
 
-#if defined(XDP_USE_NEED_WAKEUP)
-	if (xsk_ring_prod__needs_wakeup(&txq->tx))
-#endif
+	if (syscall_needed(&txq->tx, txq->pair->busy_budget)) {
 		while (send(xsk_socket__fd(txq->pair->xsk), NULL,
 			    0, MSG_DONTWAIT) < 0) {
 			/* some thing unexpected */
@@ -437,6 +436,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 					     XSK_RING_CONS__DEFAULT_NUM_DESCS,
 					     cq);
 		}
+	}
 }
 
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
@@ -1145,6 +1145,39 @@ xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 		goto err;
 	}
 
+#ifdef SO_PREFER_BUSY_POLL
+	if (rxq->busy_budget) {
+		int sock_opt = 1;
+		int fd = xsk_socket__fd(rxq->xsk);
+
+		ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
+				(void *)&sock_opt, sizeof(sock_opt));
+		if (ret < 0) {
+			AF_XDP_LOG(ERR, "Failed to set SO_PREFER_BUSY_POLL\n");
+			goto err;
+		}
+
+		sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
+		ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+				sizeof(sock_opt));
+		if (ret < 0) {
+			AF_XDP_LOG(ERR, "Failed to set SO_BUSY_POLL\n");
+			goto err;
+		}
+
+		sock_opt = rxq->busy_budget;
+		ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET,
+				(void *)&sock_opt, sizeof(sock_opt));
+		if (ret < 0) {
+			AF_XDP_LOG(ERR, "Failed to set SO_BUSY_POLL_BUDGET\n");
+			goto err;
+		} else {
+			AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
+						rxq->busy_budget);
+		}
+	}
+#endif
+
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	if (rte_pktmbuf_alloc_bulk(rxq->umem->mb_pool, fq_bufs, reserve_size)) {
 		AF_XDP_LOG(DEBUG, "Failed to get enough buffers for fq.\n");
@@ -1416,7 +1449,8 @@ xdp_get_channels_info(const char *if_name, int *max_queues,
 
 static int
 parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
-			int *queue_cnt, int *shared_umem, char *prog_path)
+			int *queue_cnt, int *shared_umem, char *prog_path,
+			int *busy_budget)
 {
 	int ret;
 
@@ -1447,6 +1481,11 @@ parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
 	if (ret < 0)
 		goto free_kvlist;
 
+	ret = rte_kvargs_process(kvlist, ETH_AF_XDP_BUDGET_ARG,
+				&parse_integer_arg, busy_budget);
+	if (ret < 0)
+		goto free_kvlist;
+
 free_kvlist:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -1485,7 +1524,7 @@ get_iface_info(const char *if_name,
 static struct rte_eth_dev *
 init_internals(struct rte_vdev_device *dev, const char *if_name,
 		int start_queue_idx, int queue_cnt, int shared_umem,
-		const char *prog_path)
+		const char *prog_path, int busy_budget)
 {
 	const char *name = rte_vdev_device_name(dev);
 	const unsigned int numa_node = dev->device.numa_node;
@@ -1546,6 +1585,7 @@ init_internals(struct rte_vdev_device *dev, const char *if_name,
 		internals->rx_queues[i].pair = &internals->tx_queues[i];
 		internals->rx_queues[i].xsk_queue_idx = start_queue_idx + i;
 		internals->tx_queues[i].xsk_queue_idx = start_queue_idx + i;
+		internals->rx_queues[i].busy_budget = busy_budget;
 	}
 
 	ret = get_iface_info(if_name, &internals->eth_addr,
@@ -1589,6 +1629,7 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 	int xsk_queue_cnt = ETH_AF_XDP_DFLT_QUEUE_COUNT;
 	int shared_umem = 0;
 	char prog_path[PATH_MAX] = {'\0'};
+	int busy_budget = -1;
 	struct rte_eth_dev *eth_dev = NULL;
 	const char *name;
 
@@ -1618,7 +1659,8 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		dev->device.numa_node = rte_socket_id();
 
 	if (parse_parameters(kvlist, if_name, &xsk_start_queue_idx,
-			     &xsk_queue_cnt, &shared_umem, prog_path) < 0) {
+			     &xsk_queue_cnt, &shared_umem, prog_path,
+			     &busy_budget) < 0) {
 		AF_XDP_LOG(ERR, "Invalid kvargs value\n");
 		return -EINVAL;
 	}
@@ -1628,8 +1670,22 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		return -EINVAL;
 	}
 
+#ifdef SO_PREFER_BUSY_POLL
+	busy_budget = busy_budget == -1 ? ETH_AF_XDP_DFLT_BUSY_BUDGET :
+					busy_budget;
+	if (!busy_budget)
+		AF_XDP_LOG(ERR, "Preferred busy polling disabled\n");
+#else
+	if (busy_budget > 0) {
+		AF_XDP_LOG(ERR, "Kernel does not support SO_PREFER_BUSY_POLL\n");
+		return -ENOTSUP;
+	}
+	busy_budget = 0;
+#endif
+
 	eth_dev = init_internals(dev, if_name, xsk_start_queue_idx,
-					xsk_queue_cnt, shared_umem, prog_path);
+					xsk_queue_cnt, shared_umem, prog_path,
+					busy_budget);
 	if (eth_dev == NULL) {
 		AF_XDP_LOG(ERR, "Failed to init internals\n");
 		return -1;
@@ -1674,4 +1730,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
 			      "start_queue=<int> "
 			      "queue_count=<int> "
 			      "shared_umem=<int> "
-			      "xdp_prog=<string> ");
+			      "xdp_prog=<string> "
+			      "busy_budget=<int>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
@ 2021-03-01 16:31     ` Ferruh Yigit
  2021-03-03 15:07       ` Loftus, Ciara
  0 siblings, 1 reply; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-01 16:31 UTC (permalink / raw)
  To: Ciara Loftus, dev

On 2/24/2021 11:18 AM, Ciara Loftus wrote:
> Prior to this the max size was 32 which was unnecessarily
> small. 

Can you please describe the impact? Why changed from 32 to 512?
I assume this is to improve the performance but can you please explicitly 
document it in the commit log?

> Also enforce the max batch size for TX for both
> copy and zero copy modes. Prior to this only copy mode
> enforced the max size.
> 

By enforcing, the PMD ignores the user provided burst value if it is more than 
PMS supported MAX, and this ignoring is done in silent. Also there is no way to 
discover this MAX value without checking the code.

Overall, why this max values are required at all? After quick check I can see 
they are used for some bulk operations, which I assume can be eliminated, what 
do you think?

> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>

<...>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-03-01 16:32     ` Ferruh Yigit
  2021-03-04 12:26       ` Loftus, Ciara
  0 siblings, 1 reply; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-01 16:32 UTC (permalink / raw)
  To: Ciara Loftus, dev

On 2/24/2021 11:18 AM, Ciara Loftus wrote:
> This commit introduces support for preferred busy polling
> to the AF_XDP PMD. This feature aims to improve single-core
> performance for AF_XDP sockets under heavy load.
> 
> A new vdev arg is introduced called 'busy_budget' whose default
> value is 64. busy_budget is the value supplied to the kernel
> with the SO_BUSY_POLL_BUDGET socket option and represents the
> busy-polling NAPI budget. To set the budget to a different value
> eg. 256:
> 
> --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> 
> Preferred busy polling is enabled by default provided a kernel with
> version >= v5.11 is in use. To disable it, set the budget to zero.
> 
> The following settings are also strongly recommended to be used in
> conjunction with this feature:
> 
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> 
> .. where eth0 is the interface being used by the PMD.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
>   doc/guides/nics/af_xdp.rst          | 38 ++++++++++++-
>   drivers/net/af_xdp/compat.h         | 13 +++++
>   drivers/net/af_xdp/rte_eth_af_xdp.c | 85 ++++++++++++++++++++++++-----
>   3 files changed, 121 insertions(+), 15 deletions(-)

Can you please update the release notes too to announce the feature?

<...>


> @@ -39,3 +39,16 @@ create_shared_socket(struct xsk_socket **xsk_ptr __rte_unused,
>   	return -1;
>   }
>   #endif
> +
> +#ifdef XDP_USE_NEED_WAKEUP
> +static int
> +syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
> +{
> +	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
> +}
> +#else
> +syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t busy_budget)
> +{
> +	return busy_budget;
> +}

Is the return type missing in the definition?

Also for the case when both 'XDP_USE_NEED_WAKEUP' & 'SO_PREFER_BUSY_POLL' this 
function will always return '0', but current implementation doesn't know this in 
the compile time and compiler can't optimize for it, do you think does it make 
sense to do this optimization?

<...>

> @@ -1628,8 +1670,22 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
>   		return -EINVAL;
>   	}
>   
> +#ifdef SO_PREFER_BUSY_POLL
> +	busy_budget = busy_budget == -1 ? ETH_AF_XDP_DFLT_BUSY_BUDGET :
> +					busy_budget;
> +	if (!busy_budget)
> +		AF_XDP_LOG(ERR, "Preferred busy polling disabled\n");

Is this an error case? What do you think changing the log level to DEBUG or INFO?

Also how these compile time flags will work if the compiled environment and run 
environment kernel version are different and incompatible?

Overall can it be possible to detect the support on runtime via 'setsockopt()' 
without compile time macros and eliminate the compile time flags? Does it make 
sense?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512
  2021-03-01 16:31     ` Ferruh Yigit
@ 2021-03-03 15:07       ` Loftus, Ciara
  2021-03-03 15:38         ` Ferruh Yigit
  0 siblings, 1 reply; 27+ messages in thread
From: Loftus, Ciara @ 2021-03-03 15:07 UTC (permalink / raw)
  To: Yigit, Ferruh; +Cc: dev

> 
> On 2/24/2021 11:18 AM, Ciara Loftus wrote:
> > Prior to this the max size was 32 which was unnecessarily
> > small.
> 
> Can you please describe the impact? Why changed from 32 to 512?
> I assume this is to improve the performance but can you please explicitly
> document it in the commit log?

Indeed - improved performance due to bulk operations and fewer ring accesses and syscalls.
The value 512 was arbitrary. I will change this to the default ring size as defined by libbpf (2048) in v2.
Will update the commit log with this info.

> 
> > Also enforce the max batch size for TX for both
> > copy and zero copy modes. Prior to this only copy mode
> > enforced the max size.
> >
> 
> By enforcing, the PMD ignores the user provided burst value if it is more than
> PMS supported MAX, and this ignoring is done in silent. Also there is no way
> to
> discover this MAX value without checking the code.
> 
> Overall, why this max values are required at all? After quick check I can see
> they are used for some bulk operations, which I assume can be eliminated,
> what
> do you think?

We need to size some arrays at compile time with this max value.

Instead of removing the bulk operations which may impact performance, how about taking an approach where we split batches that are > 2048 into smaller batches and still handle all the packets instead of discarding those > 2048. Something like what's done in ixgbe for example:
http://code.dpdk.org/dpdk/v21.02/source/drivers/net/ixgbe/ixgbe_rxtx.c#L318

Thanks,
Ciara

> 
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> 
> <...>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512
  2021-03-03 15:07       ` Loftus, Ciara
@ 2021-03-03 15:38         ` Ferruh Yigit
  0 siblings, 0 replies; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-03 15:38 UTC (permalink / raw)
  To: Loftus, Ciara; +Cc: dev

On 3/3/2021 3:07 PM, Loftus, Ciara wrote:
>>
>> On 2/24/2021 11:18 AM, Ciara Loftus wrote:
>>> Prior to this the max size was 32 which was unnecessarily
>>> small.
>>
>> Can you please describe the impact? Why changed from 32 to 512?
>> I assume this is to improve the performance but can you please explicitly
>> document it in the commit log?
> 
> Indeed - improved performance due to bulk operations and fewer ring accesses and syscalls.
> The value 512 was arbitrary. I will change this to the default ring size as defined by libbpf (2048) in v2.
> Will update the commit log with this info.
> 
>>
>>> Also enforce the max batch size for TX for both
>>> copy and zero copy modes. Prior to this only copy mode
>>> enforced the max size.
>>>
>>
>> By enforcing, the PMD ignores the user provided burst value if it is more than
>> PMS supported MAX, and this ignoring is done in silent. Also there is no way
>> to
>> discover this MAX value without checking the code.
>>
>> Overall, why this max values are required at all? After quick check I can see
>> they are used for some bulk operations, which I assume can be eliminated,
>> what
>> do you think?
> 
> We need to size some arrays at compile time with this max value.
> 
> Instead of removing the bulk operations which may impact performance, how about taking an approach where we split batches that are > 2048 into smaller batches and still handle all the packets instead of discarding those > 2048. Something like what's done in ixgbe for example:
> http://code.dpdk.org/dpdk/v21.02/source/drivers/net/ixgbe/ixgbe_rxtx.c#L318
>` 

If there is no reasonable way to eliminate the fix sized arrays, above 
suggestion looks good.

Thanks,
ferruh

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling
  2021-03-01 16:32     ` Ferruh Yigit
@ 2021-03-04 12:26       ` Loftus, Ciara
  2021-03-08 15:54         ` Loftus, Ciara
  0 siblings, 1 reply; 27+ messages in thread
From: Loftus, Ciara @ 2021-03-04 12:26 UTC (permalink / raw)
  To: Yigit, Ferruh; +Cc: dev

> 
> On 2/24/2021 11:18 AM, Ciara Loftus wrote:
> > This commit introduces support for preferred busy polling
> > to the AF_XDP PMD. This feature aims to improve single-core
> > performance for AF_XDP sockets under heavy load.
> >
> > A new vdev arg is introduced called 'busy_budget' whose default
> > value is 64. busy_budget is the value supplied to the kernel
> > with the SO_BUSY_POLL_BUDGET socket option and represents the
> > busy-polling NAPI budget. To set the budget to a different value
> > eg. 256:
> >
> > --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> >
> > Preferred busy polling is enabled by default provided a kernel with
> > version >= v5.11 is in use. To disable it, set the budget to zero.
> >
> > The following settings are also strongly recommended to be used in
> > conjunction with this feature:
> >
> > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >
> > .. where eth0 is the interface being used by the PMD.
> >
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > ---
> >   doc/guides/nics/af_xdp.rst          | 38 ++++++++++++-
> >   drivers/net/af_xdp/compat.h         | 13 +++++
> >   drivers/net/af_xdp/rte_eth_af_xdp.c | 85
> ++++++++++++++++++++++++-----
> >   3 files changed, 121 insertions(+), 15 deletions(-)
> 
> Can you please update the release notes too to announce the feature?

Will do.

> 
> <...>
> 
> 
> > @@ -39,3 +39,16 @@ create_shared_socket(struct xsk_socket **xsk_ptr
> __rte_unused,
> >   	return -1;
> >   }
> >   #endif
> > +
> > +#ifdef XDP_USE_NEED_WAKEUP
> > +static int
> > +syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
> > +{
> > +	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
> > +}
> > +#else
> > +syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t
> busy_budget)
> > +{
> > +	return busy_budget;
> > +}
> 
> Is the return type missing in the definition?

Yes. Thanks for spotting this.

> 
> Also for the case when both 'XDP_USE_NEED_WAKEUP' &
> 'SO_PREFER_BUSY_POLL' this
> function will always return '0', but current implementation doesn't know this
> in
> the compile time and compiler can't optimize for it, do you think does it make
> sense to do this optimization?

It makes sense assuming the compile environment and run environment are the same.
However you make a valid point below. If the environments are different, we can't make this optimization because we can’t rely on the presence of the flags alone to tell us if these features are supported. More below.

> 
> <...>
> 
> > @@ -1628,8 +1670,22 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device
> *dev)
> >   		return -EINVAL;
> >   	}
> >
> > +#ifdef SO_PREFER_BUSY_POLL
> > +	busy_budget = busy_budget == -1 ?
> ETH_AF_XDP_DFLT_BUSY_BUDGET :
> > +					busy_budget;
> > +	if (!busy_budget)
> > +		AF_XDP_LOG(ERR, "Preferred busy polling disabled\n");
> 
> Is this an error case? What do you think changing the log level to DEBUG or
> INFO?

+1 for INFO

> 
> Also how these compile time flags will work if the compiled environment and
> run
> environment kernel version are different and incompatible?

This is a valid point. Right now if XDP_USE_NEED_WAKEUP is defined we assume the functionality is available in the kernel. If it's not, socket creation will fail and we abort. Perhaps we should retry socket creation without the flag if we get this failure. And record if support is available in a runtime variable. I'll look at adding this as another patch to the v2 series.

> 
> Overall can it be possible to detect the support on runtime via 'setsockopt()'
> without compile time macros and eliminate the compile time flags? Does it
> make
> sense?

I think this can be done. It should allow applications compiled on older kernels without SO_PREFER_BUSY_POLL run on newer kernels with the feature.
I will tackle this in the v2.

Thanks for your feedback!

Ciara


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling
  2021-03-04 12:26       ` Loftus, Ciara
@ 2021-03-08 15:54         ` Loftus, Ciara
  0 siblings, 0 replies; 27+ messages in thread
From: Loftus, Ciara @ 2021-03-08 15:54 UTC (permalink / raw)
  To: Loftus, Ciara, Yigit, Ferruh; +Cc: dev

> >
> > On 2/24/2021 11:18 AM, Ciara Loftus wrote:
> > > This commit introduces support for preferred busy polling
> > > to the AF_XDP PMD. This feature aims to improve single-core
> > > performance for AF_XDP sockets under heavy load.
> > >
> > > A new vdev arg is introduced called 'busy_budget' whose default
> > > value is 64. busy_budget is the value supplied to the kernel
> > > with the SO_BUSY_POLL_BUDGET socket option and represents the
> > > busy-polling NAPI budget. To set the budget to a different value
> > > eg. 256:
> > >
> > > --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> > >
> > > Preferred busy polling is enabled by default provided a kernel with
> > > version >= v5.11 is in use. To disable it, set the budget to zero.
> > >
> > > The following settings are also strongly recommended to be used in
> > > conjunction with this feature:
> > >
> > > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > > echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> > >
> > > .. where eth0 is the interface being used by the PMD.
> > >
> > > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > > ---
> > >   doc/guides/nics/af_xdp.rst          | 38 ++++++++++++-
> > >   drivers/net/af_xdp/compat.h         | 13 +++++
> > >   drivers/net/af_xdp/rte_eth_af_xdp.c | 85
> > ++++++++++++++++++++++++-----
> > >   3 files changed, 121 insertions(+), 15 deletions(-)
> >
> > Can you please update the release notes too to announce the feature?
> 
> Will do.
> 
> >
> > <...>
> >
> >
> > > @@ -39,3 +39,16 @@ create_shared_socket(struct xsk_socket **xsk_ptr
> > __rte_unused,
> > >   	return -1;
> > >   }
> > >   #endif
> > > +
> > > +#ifdef XDP_USE_NEED_WAKEUP
> > > +static int
> > > +syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
> > > +{
> > > +	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
> > > +}
> > > +#else
> > > +syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t
> > busy_budget)
> > > +{
> > > +	return busy_budget;
> > > +}
> >
> > Is the return type missing in the definition?
> 
> Yes. Thanks for spotting this.
> 
> >
> > Also for the case when both 'XDP_USE_NEED_WAKEUP' &
> > 'SO_PREFER_BUSY_POLL' this
> > function will always return '0', but current implementation doesn't know
> this
> > in
> > the compile time and compiler can't optimize for it, do you think does it
> make
> > sense to do this optimization?
> 
> It makes sense assuming the compile environment and run environment are
> the same.
> However you make a valid point below. If the environments are different,
> we can't make this optimization because we can’t rely on the presence of the
> flags alone to tell us if these features are supported. More below.
> 
> >
> > <...>
> >
> > > @@ -1628,8 +1670,22 @@ rte_pmd_af_xdp_probe(struct
> rte_vdev_device
> > *dev)
> > >   		return -EINVAL;
> > >   	}
> > >
> > > +#ifdef SO_PREFER_BUSY_POLL
> > > +	busy_budget = busy_budget == -1 ?
> > ETH_AF_XDP_DFLT_BUSY_BUDGET :
> > > +					busy_budget;
> > > +	if (!busy_budget)
> > > +		AF_XDP_LOG(ERR, "Preferred busy polling disabled\n");
> >
> > Is this an error case? What do you think changing the log level to DEBUG or
> > INFO?
> 
> +1 for INFO
> 
> >
> > Also how these compile time flags will work if the compiled environment
> and
> > run
> > environment kernel version are different and incompatible?
> 
> This is a valid point. Right now if XDP_USE_NEED_WAKEUP is defined we
> assume the functionality is available in the kernel. If it's not, socket creation
> will fail and we abort. Perhaps we should retry socket creation without the
> flag if we get this failure. And record if support is available in a runtime
> variable. I'll look at adding this as another patch to the v2 series.

Hi Ferruh,

I looked at this a little more. For the v2 I'll make sure busy poll can work in these environments with different compile-time and run-time kernels and use setsockopt() to detect support in the kernel.
Since it will require significant changes and validation I'll submit a separate series ensuring the same for the other existing flags (XDP_USE_NEED_WAKEUP / XDP_UMEM_UNALIGNED_CHUNK_FLAG / shared umem).

Thanks,
Ciara

> 
> >
> > Overall can it be possible to detect the support on runtime via
> 'setsockopt()'
> > without compile time macros and eliminate the compile time flags? Does it
> > make
> > sense?
> 
> I think this can be done. It should allow applications compiled on older
> kernels without SO_PREFER_BUSY_POLL run on newer kernels with the
> feature.
> I will tackle this in the v2.
> 
> Thanks for your feedback!
> 
> Ciara


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling
  2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
                     ` (2 preceding siblings ...)
  2021-02-24 11:18   ` [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-03-09 10:19   ` Ciara Loftus
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
                       ` (3 more replies)
  3 siblings, 4 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-09 10:19 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Single-core performance of AF_XDP at high loads can be poor because
a heavily loaded NAPI context will never enter or allow for busy-polling.

1C testpmd rxonly (both IRQs and PMD on core 0):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
0.088Mpps

In order to achieve decent performance at high loads, it is currently
recommended ensure the IRQs for the netdev queue and the core running
the PMD are different.

2C testpmd rxonly (IRQs on core 0, PMD on core 1):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=0 -- \
--forward-mode=rxonly
19.26Mpps

However using an extra core is of course not ideal. The SO_PREFER_BUSY_POLL
socket option was introduced in kernel v5.11 to help improve 1C performance.
See [1].

This series sets this socket option on xsks created with DPDK (ie. instances of
the AF_XDP PMD) unless explicitly disabled or not supported by the kernel. It
was chosen to be enabled by default in order to bring the AF_XDP PMD in line
with most other PMDs which execute on a single core.

The following system and netdev settings are recommended in conjunction with
busy polling:
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

Re-running the 1C test with busy polling support and the above settings:
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
10.45Mpps

A new vdev arg is introduced called 'busy_budget' whose default value is 64.
busy_budget is the value supplied to the kernel with the SO_BUSY_POLL_BUDGET
socket option and represents the busy-polling NAPI budget ie. the number of
packets the kernel will attempt to process in the netdev's NAPI context.

To set the busy budget to 256:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=256
14.06Mpps

If you still wish to run using 2 cores (one for PMD once for IRQs) it is
recommended to disable busy polling to achieve optimal 2C performance:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=0
19.09Mpps

v1->v2:
* Set batch size to default size of ring (2048)
* Split batches > 2048 into multiples of 2048 or less and process all
packets in the same manner that is done for other drivers eg. ixgbe:
http://code.dpdk.org/dpdk/v21.02/source/drivers/net/ixgbe/ixgbe_rxtx.c#L318
* Update commit log with reasoning behing batching changes
* Update release notes with note on busy polling support
* Fix return type for sycall_needed function when the wakeup flag is not
present
* Apprpriate log leveling
* Set default_*xportconf burst sizes to the default busy budget size (64)
* Detect support for busy polling via setsockopt instead of using the presence
of the flag

RFC->v1:
* Fixed behaviour of busy_budget=0
* Ensure we bail out any of the new setsockopts fail

[1] https://lwn.net/Articles/837010/


Ciara Loftus (3):
  net/af_xdp: allow bigger batch sizes
  net/af_xdp: Use recvfrom() instead of poll()
  net/af_xdp: preferred busy polling

 doc/guides/nics/af_xdp.rst             |  38 ++++-
 doc/guides/rel_notes/release_21_05.rst |   4 +
 drivers/net/af_xdp/compat.h            |  14 ++
 drivers/net/af_xdp/rte_eth_af_xdp.c    | 190 ++++++++++++++++++++++---
 4 files changed, 222 insertions(+), 24 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
@ 2021-03-09 10:19     ` Ciara Loftus
  2021-03-09 16:33       ` Ferruh Yigit
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Ciara Loftus @ 2021-03-09 10:19 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Prior to this commit, the maximum batch sizes for zero-copy and copy-mode
rx and copy-mode tx were set to 32. Apart from zero-copy tx, the user
could never rx/tx any more than 32 packets at a time and without inspecting
the code the user wouldn't be aware of this.

This commit removes these upper limits placed on the user and instead
sets an internal batch size equal to the default ring size (2048). Batches
larger than this are still processed, however they are split into smaller
batches similar to how it's done in other drivers. This is necessary
because some arrays used during rx/tx need to be sized at compile-time.

Allowing a larger batch size allows for fewer batches and thus larger bulk
operations, fewer ring accesses and fewer syscalls which should yield
improved performance.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 67 ++++++++++++++++++++++++-----
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 3957227bf0..be524e4784 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
 
-#define ETH_AF_XDP_RX_BATCH_SIZE	32
-#define ETH_AF_XDP_TX_BATCH_SIZE	32
+#define ETH_AF_XDP_RX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
+#define ETH_AF_XDP_TX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
 
 
 struct xsk_umem_info {
@@ -329,8 +329,7 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	struct rte_mbuf *mbufs[ETH_AF_XDP_RX_BATCH_SIZE];
 
 	if (xsk_prod_nb_free(fq, free_thresh) >= free_thresh)
-		(void)reserve_fill_queue(umem, ETH_AF_XDP_RX_BATCH_SIZE,
-					 NULL, fq);
+		(void)reserve_fill_queue(umem, nb_pkts, NULL, fq);
 
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 	if (nb_pkts == 0) {
@@ -379,10 +378,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 
 static uint16_t
-eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_RX_BATCH_SIZE);
-
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_rx_zc(queue, bufs, nb_pkts);
 #else
@@ -390,6 +387,32 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 }
 
+static uint16_t
+eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	uint16_t nb_rx;
+
+	if (likely(nb_pkts <= ETH_AF_XDP_RX_BATCH_SIZE))
+		return af_xdp_rx(queue, bufs, nb_pkts);
+
+	/* Split larger batch into smaller batches of size
+	 * ETH_AF_XDP_RX_BATCH_SIZE or less.
+	 */
+	nb_rx = 0;
+	while (nb_pkts) {
+		uint16_t ret, n;
+
+		n = (uint16_t)RTE_MIN(nb_pkts, ETH_AF_XDP_RX_BATCH_SIZE);
+		ret = af_xdp_rx(queue, &bufs[nb_rx], n);
+		nb_rx = (uint16_t)(nb_rx + ret);
+		nb_pkts = (uint16_t)(nb_pkts - ret);
+		if (ret < n)
+			break;
+	}
+
+	return nb_rx;
+}
+
 static void
 pull_umem_cq(struct xsk_umem_info *umem, int size, struct xsk_ring_cons *cq)
 {
@@ -535,8 +558,6 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint32_t idx_tx;
 	struct xsk_ring_cons *cq = &txq->pair->cq;
 
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
-
 	pull_umem_cq(umem, nb_pkts, cq);
 
 	nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs,
@@ -575,6 +596,32 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	return nb_pkts;
 }
+
+static uint16_t
+af_xdp_tx_cp_batch(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	uint16_t nb_tx;
+
+	if (likely(nb_pkts <= ETH_AF_XDP_TX_BATCH_SIZE))
+		return af_xdp_tx_cp(queue, bufs, nb_pkts);
+
+	nb_tx = 0;
+	while (nb_pkts) {
+		uint16_t ret, n;
+
+		/* Split larger batch into smaller batches of size
+		 * ETH_AF_XDP_TX_BATCH_SIZE or less.
+		 */
+		n = (uint16_t)RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
+		ret = af_xdp_tx_cp(queue, &bufs[nb_tx], n);
+		nb_tx = (uint16_t)(nb_tx + ret);
+		nb_pkts = (uint16_t)(nb_pkts - ret);
+		if (ret < n)
+			break;
+	}
+
+	return nb_tx;
+}
 #endif
 
 static uint16_t
@@ -583,7 +630,7 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_tx_zc(queue, bufs, nb_pkts);
 #else
-	return af_xdp_tx_cp(queue, bufs, nb_pkts);
+	return af_xdp_tx_cp_batch(queue, bufs, nb_pkts);
 #endif
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 2/3] net/af_xdp: Use recvfrom() instead of poll()
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
@ 2021-03-09 10:19     ` Ciara Loftus
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling Ciara Loftus
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-09 10:19 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

poll() is more expensive and requires more tuning
when used with the upcoming busy polling functionality.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index be524e4784..9c0e935cd3 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -263,7 +263,8 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 
 		return 0;
@@ -335,7 +336,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 		return 0;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
@ 2021-03-09 10:19     ` Ciara Loftus
  2021-03-09 16:33       ` Ferruh Yigit
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  3 siblings, 1 reply; 27+ messages in thread
From: Ciara Loftus @ 2021-03-09 10:19 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

This commit introduces support for preferred busy polling
to the AF_XDP PMD. This feature aims to improve single-core
performance for AF_XDP sockets under heavy load.

A new vdev arg is introduced called 'busy_budget' whose default
value is 64. busy_budget is the value supplied to the kernel
with the SO_BUSY_POLL_BUDGET socket option and represents the
busy-polling NAPI budget. To set the budget to a different value
eg. 256:

--vdev=net_af_xdp0,iface=eth0,busy_budget=256

Preferred busy polling is enabled by default provided a kernel with
version >= v5.11 is in use. To disable it, set the budget to zero.

The following settings are also strongly recommended to be used in
conjunction with this feature:

echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

.. where eth0 is the interface being used by the PMD.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 doc/guides/nics/af_xdp.rst             |  38 +++++++-
 doc/guides/rel_notes/release_21_05.rst |   4 +
 drivers/net/af_xdp/compat.h            |  14 +++
 drivers/net/af_xdp/rte_eth_af_xdp.c    | 117 ++++++++++++++++++++++---
 4 files changed, 161 insertions(+), 12 deletions(-)

diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst
index 5ed24374f8..8bf40b5f0f 100644
--- a/doc/guides/nics/af_xdp.rst
+++ b/doc/guides/nics/af_xdp.rst
@@ -35,6 +35,7 @@ The following options can be provided to set up an af_xdp port in DPDK.
 *   ``shared_umem`` - PMD will attempt to share UMEM with others (optional,
     default 0);
 *   ``xdp_prog`` - path to custom xdp program (optional, default none);
+*   ``busy_budget`` - busy polling budget (optional, default 64);
 
 Prerequisites
 -------------
@@ -51,6 +52,7 @@ This is a Linux-specific PMD, thus the following prerequisites apply:
 *  For shared_umem, it requires kernel version v5.10 or later and libbpf version
    v0.2.0 or later.
 *  For 32-bit OS, a kernel with version 5.4 or later is required.
+*  For busy polling, kernel version v5.11 or later is required.
 
 Set up an af_xdp interface
 -----------------------------
@@ -107,4 +109,38 @@ Limitations
   .. code-block:: console
 
     --vdev net_af_xdp0,iface=ens786f1,shared_umem=1 \
-    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
\ No newline at end of file
+    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
+
+- **Preferred Busy Polling**
+
+  The SO_PREFER_BUSY_POLL socket option was introduced in kernel v5.11. It can
+  deliver a performance improvement for sockets with heavy traffic loads and
+  can significantly improve single-core performance in this context.
+
+  The feature is enabled by default in the AF_XDP PMD. To disable it, set the
+  'busy_budget' vdevarg to zero:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=0
+
+  The default 'busy_budget' is 64 and it represents the number of packets the
+  kernel will attempt to process in the netdev's NAPI context. You can change
+  the value for example to 256 like so:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=256
+
+  It is also strongly recommended to set the following for optimal performance:
+
+  .. code-block:: console
+
+    echo 2 | sudo tee /sys/class/net/ens786f1/napi_defer_hard_irqs
+    echo 200000 | sudo tee /sys/class/net/ens786f1/gro_flush_timeout
+
+  The above defers interrupts for interface ens786f1 and instead schedules its
+  NAPI context from a watchdog timer instead of from softirqs. More information
+  on this feature can be found at [1].
+
+  [1] https://lwn.net/Articles/837010/
\ No newline at end of file
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 23f7f0bff9..2d4794aa21 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -70,6 +70,10 @@ New Features
   * Added command to display Rx queue used descriptor count.
     ``show port (port_id) rxq (queue_id) desc used count``
 
+* **Updated the AF_XDP driver.**
+
+  * Added support for preferred busy polling.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/af_xdp/compat.h b/drivers/net/af_xdp/compat.h
index 7aa40d522e..545c8aa395 100644
--- a/drivers/net/af_xdp/compat.h
+++ b/drivers/net/af_xdp/compat.h
@@ -39,3 +39,17 @@ create_shared_socket(struct xsk_socket **xsk_ptr __rte_unused,
 	return -1;
 }
 #endif
+
+#ifdef XDP_USE_NEED_WAKEUP
+static int
+syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
+{
+	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
+}
+#else
+static int
+syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t busy_budget)
+{
+	return busy_budget;
+}
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 9c0e935cd3..4953525484 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -41,6 +41,13 @@
 
 #include "compat.h"
 
+#ifndef SO_PREFER_BUSY_POLL
+#define SO_PREFER_BUSY_POLL 69
+#endif
+#ifndef SO_BUSY_POLL_BUDGET
+#define SO_BUSY_POLL_BUDGET 70
+#endif
+
 
 #ifndef SOL_XDP
 #define SOL_XDP 283
@@ -65,6 +72,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_NUM_DESCS	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
+#define ETH_AF_XDP_DFLT_BUSY_BUDGET	64
+#define ETH_AF_XDP_DFLT_BUSY_TIMEOUT	20
 
 #define ETH_AF_XDP_RX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_TX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
@@ -100,6 +109,7 @@ struct pkt_rx_queue {
 	struct pkt_tx_queue *pair;
 	struct pollfd fds[1];
 	int xsk_queue_idx;
+	uint32_t busy_budget;
 };
 
 struct tx_stats {
@@ -140,6 +150,7 @@ struct pmd_internals {
 #define ETH_AF_XDP_QUEUE_COUNT_ARG		"queue_count"
 #define ETH_AF_XDP_SHARED_UMEM_ARG		"shared_umem"
 #define ETH_AF_XDP_PROG_ARG			"xdp_prog"
+#define ETH_AF_XDP_BUDGET_ARG			"busy_budget"
 
 static const char * const valid_arguments[] = {
 	ETH_AF_XDP_IFACE_ARG,
@@ -147,6 +158,7 @@ static const char * const valid_arguments[] = {
 	ETH_AF_XDP_QUEUE_COUNT_ARG,
 	ETH_AF_XDP_SHARED_UMEM_ARG,
 	ETH_AF_XDP_PROG_ARG,
+	ETH_AF_XDP_BUDGET_ARG,
 	NULL
 };
 
@@ -261,11 +273,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 
 		return 0;
 	}
@@ -446,9 +456,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 
 	pull_umem_cq(umem, XSK_RING_CONS__DEFAULT_NUM_DESCS, cq);
 
-#if defined(XDP_USE_NEED_WAKEUP)
-	if (xsk_ring_prod__needs_wakeup(&txq->tx))
-#endif
+	if (syscall_needed(&txq->tx, txq->pair->busy_budget))
 		while (send(xsk_socket__fd(txq->pair->xsk), NULL,
 			    0, MSG_DONTWAIT) < 0) {
 			/* some thing unexpected */
@@ -795,6 +803,8 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->max_mtu = ETH_AF_XDP_FRAME_SIZE - XDP_PACKET_HEADROOM;
 #endif
 
+	dev_info->default_rxportconf.burst_size = ETH_AF_XDP_DFLT_BUSY_BUDGET;
+	dev_info->default_txportconf.burst_size = ETH_AF_XDP_DFLT_BUSY_BUDGET;
 	dev_info->default_rxportconf.nb_queues = 1;
 	dev_info->default_txportconf.nb_queues = 1;
 	dev_info->default_rxportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS;
@@ -1142,6 +1152,65 @@ load_custom_xdp_prog(const char *prog_path, int if_index)
 	return 0;
 }
 
+/* Detect support for busy polling through setsockopt(). */
+static int
+configure_preferred_busy_poll(struct pkt_rx_queue *rxq)
+{
+	int sock_opt = 1;
+	int fd = xsk_socket__fd(rxq->xsk);
+	int ret = 0;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_PREFER_BUSY_POLL\n");
+		goto err_prefer;
+	}
+
+	sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL\n");
+		goto err_timeout;
+	}
+
+	sock_opt = rxq->busy_budget;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL_BUDGET\n");
+	} else {
+		AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
+					rxq->busy_budget);
+		return 0;
+	}
+
+	/* setsockopt failure - attempt to restore xsk to default state and
+	 * proceed without busy polling support.
+	 */
+	sock_opt = 0;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(ERR, "Failed to unset SO_BUSY_POLL\n");
+		return -1;
+	}
+
+err_timeout:
+	sock_opt = 0;
+	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(ERR, "Failed to unset SO_PREFER_BUSY_POLL\n");
+		return -1;
+	}
+
+err_prefer:
+	rxq->busy_budget = 0;
+	return 0;
+}
+
 static int
 xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 	      int ring_size)
@@ -1200,6 +1269,15 @@ xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 		goto err;
 	}
 #endif
+
+	if (rxq->busy_budget) {
+		ret = configure_preferred_busy_poll(rxq);
+		if (ret) {
+			AF_XDP_LOG(ERR, "Failed configure busy polling.\n");
+			goto err;
+		}
+	}
+
 	ret = reserve_fill_queue(rxq->umem, reserve_size, fq_bufs, &rxq->fq);
 	if (ret) {
 		xsk_socket__delete(rxq->xsk);
@@ -1257,6 +1335,9 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 		goto err;
 	}
 
+	if (!rxq->busy_budget)
+		AF_XDP_LOG(DEBUG, "Preferred busy polling not enabled\n");
+
 	rxq->fds[0].fd = xsk_socket__fd(rxq->xsk);
 	rxq->fds[0].events = POLLIN;
 
@@ -1465,7 +1546,8 @@ xdp_get_channels_info(const char *if_name, int *max_queues,
 
 static int
 parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
-			int *queue_cnt, int *shared_umem, char *prog_path)
+			int *queue_cnt, int *shared_umem, char *prog_path,
+			int *busy_budget)
 {
 	int ret;
 
@@ -1496,6 +1578,11 @@ parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
 	if (ret < 0)
 		goto free_kvlist;
 
+	ret = rte_kvargs_process(kvlist, ETH_AF_XDP_BUDGET_ARG,
+				&parse_integer_arg, busy_budget);
+	if (ret < 0)
+		goto free_kvlist;
+
 free_kvlist:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -1534,7 +1621,7 @@ get_iface_info(const char *if_name,
 static struct rte_eth_dev *
 init_internals(struct rte_vdev_device *dev, const char *if_name,
 		int start_queue_idx, int queue_cnt, int shared_umem,
-		const char *prog_path)
+		const char *prog_path, int busy_budget)
 {
 	const char *name = rte_vdev_device_name(dev);
 	const unsigned int numa_node = dev->device.numa_node;
@@ -1595,6 +1682,7 @@ init_internals(struct rte_vdev_device *dev, const char *if_name,
 		internals->rx_queues[i].pair = &internals->tx_queues[i];
 		internals->rx_queues[i].xsk_queue_idx = start_queue_idx + i;
 		internals->tx_queues[i].xsk_queue_idx = start_queue_idx + i;
+		internals->rx_queues[i].busy_budget = busy_budget;
 	}
 
 	ret = get_iface_info(if_name, &internals->eth_addr,
@@ -1638,6 +1726,7 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 	int xsk_queue_cnt = ETH_AF_XDP_DFLT_QUEUE_COUNT;
 	int shared_umem = 0;
 	char prog_path[PATH_MAX] = {'\0'};
+	int busy_budget = -1;
 	struct rte_eth_dev *eth_dev = NULL;
 	const char *name;
 
@@ -1667,7 +1756,8 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		dev->device.numa_node = rte_socket_id();
 
 	if (parse_parameters(kvlist, if_name, &xsk_start_queue_idx,
-			     &xsk_queue_cnt, &shared_umem, prog_path) < 0) {
+			     &xsk_queue_cnt, &shared_umem, prog_path,
+			     &busy_budget) < 0) {
 		AF_XDP_LOG(ERR, "Invalid kvargs value\n");
 		return -EINVAL;
 	}
@@ -1677,8 +1767,12 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		return -EINVAL;
 	}
 
+	busy_budget = busy_budget == -1 ? ETH_AF_XDP_DFLT_BUSY_BUDGET :
+					busy_budget;
+
 	eth_dev = init_internals(dev, if_name, xsk_start_queue_idx,
-					xsk_queue_cnt, shared_umem, prog_path);
+					xsk_queue_cnt, shared_umem, prog_path,
+					busy_budget);
 	if (eth_dev == NULL) {
 		AF_XDP_LOG(ERR, "Failed to init internals\n");
 		return -1;
@@ -1723,4 +1817,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
 			      "start_queue=<int> "
 			      "queue_count=<int> "
 			      "shared_umem=<int> "
-			      "xdp_prog=<string> ");
+			      "xdp_prog=<string> "
+			      "busy_budget=<int>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-03-09 16:33       ` Ferruh Yigit
  2021-03-10  7:55         ` Loftus, Ciara
  0 siblings, 1 reply; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-09 16:33 UTC (permalink / raw)
  To: Ciara Loftus, dev

On 3/9/2021 10:19 AM, Ciara Loftus wrote:
> This commit introduces support for preferred busy polling
> to the AF_XDP PMD. This feature aims to improve single-core
> performance for AF_XDP sockets under heavy load.
> 
> A new vdev arg is introduced called 'busy_budget' whose default
> value is 64. busy_budget is the value supplied to the kernel
> with the SO_BUSY_POLL_BUDGET socket option and represents the
> busy-polling NAPI budget. To set the budget to a different value
> eg. 256:
> 
> --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> 
> Preferred busy polling is enabled by default provided a kernel with
> version >= v5.11 is in use. To disable it, set the budget to zero.
> 
> The following settings are also strongly recommended to be used in
> conjunction with this feature:
> 
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> 
> .. where eth0 is the interface being used by the PMD.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>

<...>

> --- a/doc/guides/rel_notes/release_21_05.rst
> +++ b/doc/guides/rel_notes/release_21_05.rst
> @@ -70,6 +70,10 @@ New Features
>     * Added command to display Rx queue used descriptor count.
>       ``show port (port_id) rxq (queue_id) desc used count``
>   
> +* **Updated the AF_XDP driver.**
> +
> +  * Added support for preferred busy polling.
> +
>   

Can you please move the update after above the testpmd updates?
For more details the expected order is in the section comment.

> +static int
> +configure_preferred_busy_poll(struct pkt_rx_queue *rxq)
> +{
> +	int sock_opt = 1;
> +	int fd = xsk_socket__fd(rxq->xsk);
> +	int ret = 0;
> +
> +	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
> +			(void *)&sock_opt, sizeof(sock_opt));
> +	if (ret < 0) {
> +		AF_XDP_LOG(DEBUG, "Failed to set SO_PREFER_BUSY_POLL\n");
> +		goto err_prefer;
> +	}
> +
> +	sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
> +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
> +			sizeof(sock_opt));
> +	if (ret < 0) {
> +		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL\n");

[1]

> +		goto err_timeout;
> +	}
> +
> +	sock_opt = rxq->busy_budget;
> +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET,
> +			(void *)&sock_opt, sizeof(sock_opt));
> +	if (ret < 0) {
> +		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL_BUDGET\n");

In above [1] and here, shouldn't the function return error, even the rollback is 
successful.
I am thinking a case an invalid 'busy_budget' provided, like a very big number 
or negative value.

> +	} else {
> +		AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
> +					rxq->busy_budget);
> +		return 0;
> +	}
> +
> +	/* setsockopt failure - attempt to restore xsk to default state and
> +	 * proceed without busy polling support.
> +	 */
> +	sock_opt = 0;
> +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
> +			sizeof(sock_opt));
> +	if (ret < 0) {
> +		AF_XDP_LOG(ERR, "Failed to unset SO_BUSY_POLL\n");
> +		return -1;
> +	}
> +
> +err_timeout:
> +	sock_opt = 0;
> +	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
> +			(void *)&sock_opt, sizeof(sock_opt));
> +	if (ret < 0) {
> +		AF_XDP_LOG(ERR, "Failed to unset SO_PREFER_BUSY_POLL\n");
> +		return -1;
> +	}
> +
> +err_prefer:
> +	rxq->busy_budget = 0;
> +	return 0;
> +}
> +

<...>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
@ 2021-03-09 16:33       ` Ferruh Yigit
  2021-03-10  7:21         ` Loftus, Ciara
  0 siblings, 1 reply; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-09 16:33 UTC (permalink / raw)
  To: Ciara Loftus, dev

On 3/9/2021 10:19 AM, Ciara Loftus wrote:
> Prior to this commit, the maximum batch sizes for zero-copy and copy-mode
> rx and copy-mode tx were set to 32. Apart from zero-copy tx, the user
> could never rx/tx any more than 32 packets at a time and without inspecting
> the code the user wouldn't be aware of this.
> 
> This commit removes these upper limits placed on the user and instead
> sets an internal batch size equal to the default ring size (2048). Batches
> larger than this are still processed, however they are split into smaller
> batches similar to how it's done in other drivers. This is necessary
> because some arrays used during rx/tx need to be sized at compile-time.
> 
> Allowing a larger batch size allows for fewer batches and thus larger bulk
> operations, fewer ring accesses and fewer syscalls which should yield
> improved performance.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
>   drivers/net/af_xdp/rte_eth_af_xdp.c | 67 ++++++++++++++++++++++++-----
>   1 file changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
> index 3957227bf0..be524e4784 100644
> --- a/drivers/net/af_xdp/rte_eth_af_xdp.c
> +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
> @@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
>   #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
>   #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
>   
> -#define ETH_AF_XDP_RX_BATCH_SIZE	32
> -#define ETH_AF_XDP_TX_BATCH_SIZE	32
> +#define ETH_AF_XDP_RX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
> +#define ETH_AF_XDP_TX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
> 

Just to double check, can there be a library version that these macros not 
defined, should it be checked?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes
  2021-03-09 16:33       ` Ferruh Yigit
@ 2021-03-10  7:21         ` Loftus, Ciara
  0 siblings, 0 replies; 27+ messages in thread
From: Loftus, Ciara @ 2021-03-10  7:21 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

> 
> On 3/9/2021 10:19 AM, Ciara Loftus wrote:
> > Prior to this commit, the maximum batch sizes for zero-copy and copy-
> mode
> > rx and copy-mode tx were set to 32. Apart from zero-copy tx, the user
> > could never rx/tx any more than 32 packets at a time and without
> inspecting
> > the code the user wouldn't be aware of this.
> >
> > This commit removes these upper limits placed on the user and instead
> > sets an internal batch size equal to the default ring size (2048). Batches
> > larger than this are still processed, however they are split into smaller
> > batches similar to how it's done in other drivers. This is necessary
> > because some arrays used during rx/tx need to be sized at compile-time.
> >
> > Allowing a larger batch size allows for fewer batches and thus larger bulk
> > operations, fewer ring accesses and fewer syscalls which should yield
> > improved performance.
> >
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > ---
> >   drivers/net/af_xdp/rte_eth_af_xdp.c | 67
> ++++++++++++++++++++++++-----
> >   1 file changed, 57 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c
> b/drivers/net/af_xdp/rte_eth_af_xdp.c
> > index 3957227bf0..be524e4784 100644
> > --- a/drivers/net/af_xdp/rte_eth_af_xdp.c
> > +++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
> > @@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype,
> pmd.net.af_xdp, NOTICE);
> >   #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
> >   #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
> >
> > -#define ETH_AF_XDP_RX_BATCH_SIZE	32
> > -#define ETH_AF_XDP_TX_BATCH_SIZE	32
> > +#define ETH_AF_XDP_RX_BATCH_SIZE
> 	XSK_RING_CONS__DEFAULT_NUM_DESCS
> > +#define ETH_AF_XDP_TX_BATCH_SIZE
> 	XSK_RING_CONS__DEFAULT_NUM_DESCS
> >
> 
> Just to double check, can there be a library version that these macros not
> defined, should it be checked?

There can't be a library version with AF_XDP support without these macros, as they've been around since the very beginning:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1cad078842396
We only build the PMD if xsk.h is available, and since these macros have been in the file since it has existed, we're safe.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling
  2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
                       ` (2 preceding siblings ...)
  2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-03-10  7:48     ` Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
                         ` (3 more replies)
  3 siblings, 4 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-10  7:48 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Single-core performance of AF_XDP at high loads can be poor because
a heavily loaded NAPI context will never enter or allow for busy-polling.

1C testpmd rxonly (both IRQs and PMD on core 0):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
0.088Mpps

In order to achieve decent performance at high loads, it is currently
recommended ensure the IRQs for the netdev queue and the core running
the PMD are different.

2C testpmd rxonly (IRQs on core 0, PMD on core 1):
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=0 -- \
--forward-mode=rxonly
19.26Mpps

However using an extra core is of course not ideal. The SO_PREFER_BUSY_POLL
socket option was introduced in kernel v5.11 to help improve 1C performance.
See [1].

This series sets this socket option on xsks created with DPDK (ie. instances of
the AF_XDP PMD) unless explicitly disabled or not supported by the kernel. It
was chosen to be enabled by default in order to bring the AF_XDP PMD in line
with most other PMDs which execute on a single core.

The following system and netdev settings are recommended in conjunction with
busy polling:
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

Re-running the 1C test with busy polling support and the above settings:
./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
--forward-mode=rxonly
10.45Mpps

A new vdev arg is introduced called 'busy_budget' whose default value is 64.
busy_budget is the value supplied to the kernel with the SO_BUSY_POLL_BUDGET
socket option and represents the busy-polling NAPI budget ie. the number of
packets the kernel will attempt to process in the netdev's NAPI context.

To set the busy budget to 256:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=256
14.06Mpps

If you still wish to run using 2 cores (one for PMD once for IRQs) it is
recommended to disable busy polling to achieve optimal 2C performance:
./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=0
19.09Mpps

v2->v3:
* Moved release notes update to correct location
* Changed busy_budget from uint32_t to int since this is the type expected
by setsockopt
* Validate busy_budget arg is <= UINT16_MAX during parse

v1->v2:
* Set batch size to default size of ring (2048)
* Split batches > 2048 into multiples of 2048 or less and process all
packets in the same manner that is done for other drivers eg. ixgbe:
http://code.dpdk.org/dpdk/v21.02/source/drivers/net/ixgbe/ixgbe_rxtx.c#L318
* Update commit log with reasoning behing batching changes
* Update release notes with note on busy polling support
* Fix return type for sycall_needed function when the wakeup flag is not
present
* Apprpriate log leveling
* Set default_*xportconf burst sizes to the default busy budget size (64)
* Detect support for busy polling via setsockopt instead of using the presence
of the flag

RFC->v1:
* Fixed behaviour of busy_budget=0
* Ensure we bail out any of the new setsockopts fail

[1] https://lwn.net/Articles/837010/


Ciara Loftus (3):
  net/af_xdp: allow bigger batch sizes
  net/af_xdp: Use recvfrom() instead of poll()
  net/af_xdp: preferred busy polling

 doc/guides/nics/af_xdp.rst             |  38 ++++-
 doc/guides/rel_notes/release_21_05.rst |   4 +
 drivers/net/af_xdp/compat.h            |  14 ++
 drivers/net/af_xdp/rte_eth_af_xdp.c    | 208 ++++++++++++++++++++++---
 4 files changed, 240 insertions(+), 24 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] net/af_xdp: allow bigger batch sizes
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
@ 2021-03-10  7:48       ` Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-10  7:48 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

Prior to this commit, the maximum batch sizes for zero-copy and copy-mode
rx and copy-mode tx were set to 32. Apart from zero-copy tx, the user
could never rx/tx any more than 32 packets at a time and without inspecting
the code the user wouldn't be aware of this.

This commit removes these upper limits placed on the user and instead
sets an internal batch size equal to the default ring size (2048). Batches
larger than this are still processed, however they are split into smaller
batches similar to how it's done in other drivers. This is necessary
because some arrays used during rx/tx need to be sized at compile-time.

Allowing a larger batch size allows for fewer batches and thus larger bulk
operations, fewer ring accesses and fewer syscalls which should yield
improved performance.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 67 ++++++++++++++++++++++++-----
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 3957227bf0..be524e4784 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -66,8 +66,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
 
-#define ETH_AF_XDP_RX_BATCH_SIZE	32
-#define ETH_AF_XDP_TX_BATCH_SIZE	32
+#define ETH_AF_XDP_RX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
+#define ETH_AF_XDP_TX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
 
 
 struct xsk_umem_info {
@@ -329,8 +329,7 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	struct rte_mbuf *mbufs[ETH_AF_XDP_RX_BATCH_SIZE];
 
 	if (xsk_prod_nb_free(fq, free_thresh) >= free_thresh)
-		(void)reserve_fill_queue(umem, ETH_AF_XDP_RX_BATCH_SIZE,
-					 NULL, fq);
+		(void)reserve_fill_queue(umem, nb_pkts, NULL, fq);
 
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 	if (nb_pkts == 0) {
@@ -379,10 +378,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 
 static uint16_t
-eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_RX_BATCH_SIZE);
-
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_rx_zc(queue, bufs, nb_pkts);
 #else
@@ -390,6 +387,32 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 }
 
+static uint16_t
+eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	uint16_t nb_rx;
+
+	if (likely(nb_pkts <= ETH_AF_XDP_RX_BATCH_SIZE))
+		return af_xdp_rx(queue, bufs, nb_pkts);
+
+	/* Split larger batch into smaller batches of size
+	 * ETH_AF_XDP_RX_BATCH_SIZE or less.
+	 */
+	nb_rx = 0;
+	while (nb_pkts) {
+		uint16_t ret, n;
+
+		n = (uint16_t)RTE_MIN(nb_pkts, ETH_AF_XDP_RX_BATCH_SIZE);
+		ret = af_xdp_rx(queue, &bufs[nb_rx], n);
+		nb_rx = (uint16_t)(nb_rx + ret);
+		nb_pkts = (uint16_t)(nb_pkts - ret);
+		if (ret < n)
+			break;
+	}
+
+	return nb_rx;
+}
+
 static void
 pull_umem_cq(struct xsk_umem_info *umem, int size, struct xsk_ring_cons *cq)
 {
@@ -535,8 +558,6 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint32_t idx_tx;
 	struct xsk_ring_cons *cq = &txq->pair->cq;
 
-	nb_pkts = RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
-
 	pull_umem_cq(umem, nb_pkts, cq);
 
 	nb_pkts = rte_ring_dequeue_bulk(umem->buf_ring, addrs,
@@ -575,6 +596,32 @@ af_xdp_tx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	return nb_pkts;
 }
+
+static uint16_t
+af_xdp_tx_cp_batch(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	uint16_t nb_tx;
+
+	if (likely(nb_pkts <= ETH_AF_XDP_TX_BATCH_SIZE))
+		return af_xdp_tx_cp(queue, bufs, nb_pkts);
+
+	nb_tx = 0;
+	while (nb_pkts) {
+		uint16_t ret, n;
+
+		/* Split larger batch into smaller batches of size
+		 * ETH_AF_XDP_TX_BATCH_SIZE or less.
+		 */
+		n = (uint16_t)RTE_MIN(nb_pkts, ETH_AF_XDP_TX_BATCH_SIZE);
+		ret = af_xdp_tx_cp(queue, &bufs[nb_tx], n);
+		nb_tx = (uint16_t)(nb_tx + ret);
+		nb_pkts = (uint16_t)(nb_pkts - ret);
+		if (ret < n)
+			break;
+	}
+
+	return nb_tx;
+}
 #endif
 
 static uint16_t
@@ -583,7 +630,7 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #if defined(XDP_UMEM_UNALIGNED_CHUNK_FLAG)
 	return af_xdp_tx_zc(queue, bufs, nb_pkts);
 #else
-	return af_xdp_tx_cp(queue, bufs, nb_pkts);
+	return af_xdp_tx_cp_batch(queue, bufs, nb_pkts);
 #endif
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] net/af_xdp: Use recvfrom() instead of poll()
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
@ 2021-03-10  7:48       ` Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 3/3] net/af_xdp: preferred busy polling Ciara Loftus
  2021-03-10 17:50       ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ferruh Yigit
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-10  7:48 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

poll() is more expensive and requires more tuning
when used with the upcoming busy polling functionality.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index be524e4784..9c0e935cd3 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -263,7 +263,8 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 
 		return 0;
@@ -335,7 +336,8 @@ af_xdp_rx_cp(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	if (nb_pkts == 0) {
 #if defined(XDP_USE_NEED_WAKEUP)
 		if (xsk_ring_prod__needs_wakeup(fq))
-			(void)poll(rxq->fds, 1, 1000);
+			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
+				MSG_DONTWAIT, NULL, NULL);
 #endif
 		return 0;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] net/af_xdp: preferred busy polling
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
@ 2021-03-10  7:48       ` Ciara Loftus
  2021-03-10 17:50       ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ferruh Yigit
  3 siblings, 0 replies; 27+ messages in thread
From: Ciara Loftus @ 2021-03-10  7:48 UTC (permalink / raw)
  To: dev; +Cc: Ciara Loftus

This commit introduces support for preferred busy polling
to the AF_XDP PMD. This feature aims to improve single-core
performance for AF_XDP sockets under heavy load.

A new vdev arg is introduced called 'busy_budget' whose default
value is 64. busy_budget is the value supplied to the kernel
with the SO_BUSY_POLL_BUDGET socket option and represents the
busy-polling NAPI budget. To set the budget to a different value
eg. 256:

--vdev=net_af_xdp0,iface=eth0,busy_budget=256

Preferred busy polling is enabled by default provided a kernel with
version >= v5.11 is in use. To disable it, set the budget to zero.

The following settings are also strongly recommended to be used in
conjunction with this feature:

echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout

.. where eth0 is the interface being used by the PMD.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
---
 doc/guides/nics/af_xdp.rst             |  38 ++++++-
 doc/guides/rel_notes/release_21_05.rst |   4 +
 drivers/net/af_xdp/compat.h            |  14 +++
 drivers/net/af_xdp/rte_eth_af_xdp.c    | 135 +++++++++++++++++++++++--
 4 files changed, 179 insertions(+), 12 deletions(-)

diff --git a/doc/guides/nics/af_xdp.rst b/doc/guides/nics/af_xdp.rst
index 5ed24374f8..8bf40b5f0f 100644
--- a/doc/guides/nics/af_xdp.rst
+++ b/doc/guides/nics/af_xdp.rst
@@ -35,6 +35,7 @@ The following options can be provided to set up an af_xdp port in DPDK.
 *   ``shared_umem`` - PMD will attempt to share UMEM with others (optional,
     default 0);
 *   ``xdp_prog`` - path to custom xdp program (optional, default none);
+*   ``busy_budget`` - busy polling budget (optional, default 64);
 
 Prerequisites
 -------------
@@ -51,6 +52,7 @@ This is a Linux-specific PMD, thus the following prerequisites apply:
 *  For shared_umem, it requires kernel version v5.10 or later and libbpf version
    v0.2.0 or later.
 *  For 32-bit OS, a kernel with version 5.4 or later is required.
+*  For busy polling, kernel version v5.11 or later is required.
 
 Set up an af_xdp interface
 -----------------------------
@@ -107,4 +109,38 @@ Limitations
   .. code-block:: console
 
     --vdev net_af_xdp0,iface=ens786f1,shared_umem=1 \
-    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
\ No newline at end of file
+    --vdev net_af_xdp1,iface=ens786f2,shared_umem=1 \
+
+- **Preferred Busy Polling**
+
+  The SO_PREFER_BUSY_POLL socket option was introduced in kernel v5.11. It can
+  deliver a performance improvement for sockets with heavy traffic loads and
+  can significantly improve single-core performance in this context.
+
+  The feature is enabled by default in the AF_XDP PMD. To disable it, set the
+  'busy_budget' vdevarg to zero:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=0
+
+  The default 'busy_budget' is 64 and it represents the number of packets the
+  kernel will attempt to process in the netdev's NAPI context. You can change
+  the value for example to 256 like so:
+
+  .. code-block:: console
+
+    --vdev net_af_xdp0,iface=ens786f1,busy_budget=256
+
+  It is also strongly recommended to set the following for optimal performance:
+
+  .. code-block:: console
+
+    echo 2 | sudo tee /sys/class/net/ens786f1/napi_defer_hard_irqs
+    echo 200000 | sudo tee /sys/class/net/ens786f1/gro_flush_timeout
+
+  The above defers interrupts for interface ens786f1 and instead schedules its
+  NAPI context from a watchdog timer instead of from softirqs. More information
+  on this feature can be found at [1].
+
+  [1] https://lwn.net/Articles/837010/
\ No newline at end of file
diff --git a/doc/guides/rel_notes/release_21_05.rst b/doc/guides/rel_notes/release_21_05.rst
index 23f7f0bff9..368e41fcb3 100644
--- a/doc/guides/rel_notes/release_21_05.rst
+++ b/doc/guides/rel_notes/release_21_05.rst
@@ -65,6 +65,10 @@ New Features
 
   * Added support for txgbevf PMD.
 
+* **Updated the AF_XDP driver.**
+
+  * Added support for preferred busy polling.
+
 * **Updated testpmd.**
 
   * Added command to display Rx queue used descriptor count.
diff --git a/drivers/net/af_xdp/compat.h b/drivers/net/af_xdp/compat.h
index 7aa40d522e..545c8aa395 100644
--- a/drivers/net/af_xdp/compat.h
+++ b/drivers/net/af_xdp/compat.h
@@ -39,3 +39,17 @@ create_shared_socket(struct xsk_socket **xsk_ptr __rte_unused,
 	return -1;
 }
 #endif
+
+#ifdef XDP_USE_NEED_WAKEUP
+static int
+syscall_needed(struct xsk_ring_prod *q, uint32_t busy_budget)
+{
+	return xsk_ring_prod__needs_wakeup(q) | busy_budget;
+}
+#else
+static int
+syscall_needed(struct xsk_ring_prod *q __rte_unused, uint32_t busy_budget)
+{
+	return busy_budget;
+}
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 9c0e935cd3..a64fef1cf5 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -41,6 +41,13 @@
 
 #include "compat.h"
 
+#ifndef SO_PREFER_BUSY_POLL
+#define SO_PREFER_BUSY_POLL 69
+#endif
+#ifndef SO_BUSY_POLL_BUDGET
+#define SO_BUSY_POLL_BUDGET 70
+#endif
+
 
 #ifndef SOL_XDP
 #define SOL_XDP 283
@@ -65,6 +72,8 @@ RTE_LOG_REGISTER(af_xdp_logtype, pmd.net.af_xdp, NOTICE);
 #define ETH_AF_XDP_DFLT_NUM_DESCS	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_DFLT_START_QUEUE_IDX	0
 #define ETH_AF_XDP_DFLT_QUEUE_COUNT	1
+#define ETH_AF_XDP_DFLT_BUSY_BUDGET	64
+#define ETH_AF_XDP_DFLT_BUSY_TIMEOUT	20
 
 #define ETH_AF_XDP_RX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
 #define ETH_AF_XDP_TX_BATCH_SIZE	XSK_RING_CONS__DEFAULT_NUM_DESCS
@@ -100,6 +109,7 @@ struct pkt_rx_queue {
 	struct pkt_tx_queue *pair;
 	struct pollfd fds[1];
 	int xsk_queue_idx;
+	int busy_budget;
 };
 
 struct tx_stats {
@@ -140,6 +150,7 @@ struct pmd_internals {
 #define ETH_AF_XDP_QUEUE_COUNT_ARG		"queue_count"
 #define ETH_AF_XDP_SHARED_UMEM_ARG		"shared_umem"
 #define ETH_AF_XDP_PROG_ARG			"xdp_prog"
+#define ETH_AF_XDP_BUDGET_ARG			"busy_budget"
 
 static const char * const valid_arguments[] = {
 	ETH_AF_XDP_IFACE_ARG,
@@ -147,6 +158,7 @@ static const char * const valid_arguments[] = {
 	ETH_AF_XDP_QUEUE_COUNT_ARG,
 	ETH_AF_XDP_SHARED_UMEM_ARG,
 	ETH_AF_XDP_PROG_ARG,
+	ETH_AF_XDP_BUDGET_ARG,
 	NULL
 };
 
@@ -261,11 +273,9 @@ af_xdp_rx_zc(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	nb_pkts = xsk_ring_cons__peek(rx, nb_pkts, &idx_rx);
 
 	if (nb_pkts == 0) {
-#if defined(XDP_USE_NEED_WAKEUP)
-		if (xsk_ring_prod__needs_wakeup(fq))
+		if (syscall_needed(&rxq->fq, rxq->busy_budget))
 			recvfrom(xsk_socket__fd(rxq->xsk), NULL, 0,
 				MSG_DONTWAIT, NULL, NULL);
-#endif
 
 		return 0;
 	}
@@ -446,9 +456,7 @@ kick_tx(struct pkt_tx_queue *txq, struct xsk_ring_cons *cq)
 
 	pull_umem_cq(umem, XSK_RING_CONS__DEFAULT_NUM_DESCS, cq);
 
-#if defined(XDP_USE_NEED_WAKEUP)
-	if (xsk_ring_prod__needs_wakeup(&txq->tx))
-#endif
+	if (syscall_needed(&txq->tx, txq->pair->busy_budget))
 		while (send(xsk_socket__fd(txq->pair->xsk), NULL,
 			    0, MSG_DONTWAIT) < 0) {
 			/* some thing unexpected */
@@ -795,6 +803,8 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 	dev_info->max_mtu = ETH_AF_XDP_FRAME_SIZE - XDP_PACKET_HEADROOM;
 #endif
 
+	dev_info->default_rxportconf.burst_size = ETH_AF_XDP_DFLT_BUSY_BUDGET;
+	dev_info->default_txportconf.burst_size = ETH_AF_XDP_DFLT_BUSY_BUDGET;
 	dev_info->default_rxportconf.nb_queues = 1;
 	dev_info->default_txportconf.nb_queues = 1;
 	dev_info->default_rxportconf.ring_size = ETH_AF_XDP_DFLT_NUM_DESCS;
@@ -1142,6 +1152,65 @@ load_custom_xdp_prog(const char *prog_path, int if_index)
 	return 0;
 }
 
+/* Detect support for busy polling through setsockopt(). */
+static int
+configure_preferred_busy_poll(struct pkt_rx_queue *rxq)
+{
+	int sock_opt = 1;
+	int fd = xsk_socket__fd(rxq->xsk);
+	int ret = 0;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_PREFER_BUSY_POLL\n");
+		goto err_prefer;
+	}
+
+	sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL\n");
+		goto err_timeout;
+	}
+
+	sock_opt = rxq->busy_budget;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL_BUDGET\n");
+	} else {
+		AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
+					rxq->busy_budget);
+		return 0;
+	}
+
+	/* setsockopt failure - attempt to restore xsk to default state and
+	 * proceed without busy polling support.
+	 */
+	sock_opt = 0;
+	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void *)&sock_opt,
+			sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(ERR, "Failed to unset SO_BUSY_POLL\n");
+		return -1;
+	}
+
+err_timeout:
+	sock_opt = 0;
+	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
+			(void *)&sock_opt, sizeof(sock_opt));
+	if (ret < 0) {
+		AF_XDP_LOG(ERR, "Failed to unset SO_PREFER_BUSY_POLL\n");
+		return -1;
+	}
+
+err_prefer:
+	rxq->busy_budget = 0;
+	return 0;
+}
+
 static int
 xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 	      int ring_size)
@@ -1200,6 +1269,15 @@ xsk_configure(struct pmd_internals *internals, struct pkt_rx_queue *rxq,
 		goto err;
 	}
 #endif
+
+	if (rxq->busy_budget) {
+		ret = configure_preferred_busy_poll(rxq);
+		if (ret) {
+			AF_XDP_LOG(ERR, "Failed configure busy polling.\n");
+			goto err;
+		}
+	}
+
 	ret = reserve_fill_queue(rxq->umem, reserve_size, fq_bufs, &rxq->fq);
 	if (ret) {
 		xsk_socket__delete(rxq->xsk);
@@ -1257,6 +1335,9 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 		goto err;
 	}
 
+	if (!rxq->busy_budget)
+		AF_XDP_LOG(DEBUG, "Preferred busy polling not enabled\n");
+
 	rxq->fds[0].fd = xsk_socket__fd(rxq->xsk);
 	rxq->fds[0].events = POLLIN;
 
@@ -1363,6 +1444,24 @@ static const struct eth_dev_ops ops = {
 	.stats_reset = eth_stats_reset,
 };
 
+/** parse busy_budget argument */
+static int
+parse_budget_arg(const char *key __rte_unused,
+		  const char *value, void *extra_args)
+{
+	int *i = (int *)extra_args;
+	char *end;
+
+	*i = strtol(value, &end, 10);
+	if (*i < 0 || *i > UINT16_MAX) {
+		AF_XDP_LOG(ERR, "Invalid busy_budget %i, must be >= 0 and <= %u\n",
+				*i, UINT16_MAX);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 /** parse integer from integer argument */
 static int
 parse_integer_arg(const char *key __rte_unused,
@@ -1465,7 +1564,8 @@ xdp_get_channels_info(const char *if_name, int *max_queues,
 
 static int
 parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
-			int *queue_cnt, int *shared_umem, char *prog_path)
+			int *queue_cnt, int *shared_umem, char *prog_path,
+			int *busy_budget)
 {
 	int ret;
 
@@ -1496,6 +1596,11 @@ parse_parameters(struct rte_kvargs *kvlist, char *if_name, int *start_queue,
 	if (ret < 0)
 		goto free_kvlist;
 
+	ret = rte_kvargs_process(kvlist, ETH_AF_XDP_BUDGET_ARG,
+				&parse_budget_arg, busy_budget);
+	if (ret < 0)
+		goto free_kvlist;
+
 free_kvlist:
 	rte_kvargs_free(kvlist);
 	return ret;
@@ -1534,7 +1639,7 @@ get_iface_info(const char *if_name,
 static struct rte_eth_dev *
 init_internals(struct rte_vdev_device *dev, const char *if_name,
 		int start_queue_idx, int queue_cnt, int shared_umem,
-		const char *prog_path)
+		const char *prog_path, int busy_budget)
 {
 	const char *name = rte_vdev_device_name(dev);
 	const unsigned int numa_node = dev->device.numa_node;
@@ -1595,6 +1700,7 @@ init_internals(struct rte_vdev_device *dev, const char *if_name,
 		internals->rx_queues[i].pair = &internals->tx_queues[i];
 		internals->rx_queues[i].xsk_queue_idx = start_queue_idx + i;
 		internals->tx_queues[i].xsk_queue_idx = start_queue_idx + i;
+		internals->rx_queues[i].busy_budget = busy_budget;
 	}
 
 	ret = get_iface_info(if_name, &internals->eth_addr,
@@ -1638,6 +1744,7 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 	int xsk_queue_cnt = ETH_AF_XDP_DFLT_QUEUE_COUNT;
 	int shared_umem = 0;
 	char prog_path[PATH_MAX] = {'\0'};
+	int busy_budget = -1;
 	struct rte_eth_dev *eth_dev = NULL;
 	const char *name;
 
@@ -1667,7 +1774,8 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		dev->device.numa_node = rte_socket_id();
 
 	if (parse_parameters(kvlist, if_name, &xsk_start_queue_idx,
-			     &xsk_queue_cnt, &shared_umem, prog_path) < 0) {
+			     &xsk_queue_cnt, &shared_umem, prog_path,
+			     &busy_budget) < 0) {
 		AF_XDP_LOG(ERR, "Invalid kvargs value\n");
 		return -EINVAL;
 	}
@@ -1677,8 +1785,12 @@ rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
 		return -EINVAL;
 	}
 
+	busy_budget = busy_budget == -1 ? ETH_AF_XDP_DFLT_BUSY_BUDGET :
+					busy_budget;
+
 	eth_dev = init_internals(dev, if_name, xsk_start_queue_idx,
-					xsk_queue_cnt, shared_umem, prog_path);
+					xsk_queue_cnt, shared_umem, prog_path,
+					busy_budget);
 	if (eth_dev == NULL) {
 		AF_XDP_LOG(ERR, "Failed to init internals\n");
 		return -1;
@@ -1723,4 +1835,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
 			      "start_queue=<int> "
 			      "queue_count=<int> "
 			      "shared_umem=<int> "
-			      "xdp_prog=<string> ");
+			      "xdp_prog=<string> "
+			      "busy_budget=<int>");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling
  2021-03-09 16:33       ` Ferruh Yigit
@ 2021-03-10  7:55         ` Loftus, Ciara
  0 siblings, 0 replies; 27+ messages in thread
From: Loftus, Ciara @ 2021-03-10  7:55 UTC (permalink / raw)
  To: Yigit, Ferruh, dev

> On 3/9/2021 10:19 AM, Ciara Loftus wrote:
> > This commit introduces support for preferred busy polling
> > to the AF_XDP PMD. This feature aims to improve single-core
> > performance for AF_XDP sockets under heavy load.
> >
> > A new vdev arg is introduced called 'busy_budget' whose default
> > value is 64. busy_budget is the value supplied to the kernel
> > with the SO_BUSY_POLL_BUDGET socket option and represents the
> > busy-polling NAPI budget. To set the budget to a different value
> > eg. 256:
> >
> > --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> >
> > Preferred busy polling is enabled by default provided a kernel with
> > version >= v5.11 is in use. To disable it, set the budget to zero.
> >
> > The following settings are also strongly recommended to be used in
> > conjunction with this feature:
> >
> > echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> > echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> >
> > .. where eth0 is the interface being used by the PMD.
> >
> > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> 
> <...>
> 
> > --- a/doc/guides/rel_notes/release_21_05.rst
> > +++ b/doc/guides/rel_notes/release_21_05.rst
> > @@ -70,6 +70,10 @@ New Features
> >     * Added command to display Rx queue used descriptor count.
> >       ``show port (port_id) rxq (queue_id) desc used count``
> >
> > +* **Updated the AF_XDP driver.**
> > +
> > +  * Added support for preferred busy polling.
> > +
> >
> 
> Can you please move the update after above the testpmd updates?
> For more details the expected order is in the section comment.
> 
> > +static int
> > +configure_preferred_busy_poll(struct pkt_rx_queue *rxq)
> > +{
> > +	int sock_opt = 1;
> > +	int fd = xsk_socket__fd(rxq->xsk);
> > +	int ret = 0;
> > +
> > +	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
> > +			(void *)&sock_opt, sizeof(sock_opt));
> > +	if (ret < 0) {
> > +		AF_XDP_LOG(DEBUG, "Failed to set
> SO_PREFER_BUSY_POLL\n");
> > +		goto err_prefer;
> > +	}
> > +
> > +	sock_opt = ETH_AF_XDP_DFLT_BUSY_TIMEOUT;
> > +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void
> *)&sock_opt,
> > +			sizeof(sock_opt));
> > +	if (ret < 0) {
> > +		AF_XDP_LOG(DEBUG, "Failed to set SO_BUSY_POLL\n");
> 
> [1]
> 
> > +		goto err_timeout;
> > +	}
> > +
> > +	sock_opt = rxq->busy_budget;
> > +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL_BUDGET,
> > +			(void *)&sock_opt, sizeof(sock_opt));
> > +	if (ret < 0) {
> > +		AF_XDP_LOG(DEBUG, "Failed to set
> SO_BUSY_POLL_BUDGET\n");
> 
> In above [1] and here, shouldn't the function return error, even the rollback
> is
> successful.
> I am thinking a case an invalid 'busy_budget' provided, like a very big number
> or negative value.

How about introducing a check when parsing the argument at init and failing then, instead of here?
In that case if we fail here it should not be due to invalid value. It would be due to insufficient permissions.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/sock.c?h=v5.11#n1150
In that case I think issuing a log, rolling back and continuing with setup would be best, instead of returning an error and aborting and forcing the user to explicitly disable busy polling via busy_budget=0 in order to get the PMD to initialize.

> 
> > +	} else {
> > +		AF_XDP_LOG(INFO, "Busy polling budget set to: %u\n",
> > +					rxq->busy_budget);
> > +		return 0;
> > +	}
> > +
> > +	/* setsockopt failure - attempt to restore xsk to default state and
> > +	 * proceed without busy polling support.
> > +	 */
> > +	sock_opt = 0;
> > +	ret = setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, (void
> *)&sock_opt,
> > +			sizeof(sock_opt));
> > +	if (ret < 0) {
> > +		AF_XDP_LOG(ERR, "Failed to unset SO_BUSY_POLL\n");
> > +		return -1;
> > +	}
> > +
> > +err_timeout:
> > +	sock_opt = 0;
> > +	ret = setsockopt(fd, SOL_SOCKET, SO_PREFER_BUSY_POLL,
> > +			(void *)&sock_opt, sizeof(sock_opt));
> > +	if (ret < 0) {
> > +		AF_XDP_LOG(ERR, "Failed to unset
> SO_PREFER_BUSY_POLL\n");
> > +		return -1;
> > +	}
> > +
> > +err_prefer:
> > +	rxq->busy_budget = 0;
> > +	return 0;
> > +}
> > +
> 
> <...>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling
  2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
                         ` (2 preceding siblings ...)
  2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 3/3] net/af_xdp: preferred busy polling Ciara Loftus
@ 2021-03-10 17:50       ` Ferruh Yigit
  3 siblings, 0 replies; 27+ messages in thread
From: Ferruh Yigit @ 2021-03-10 17:50 UTC (permalink / raw)
  To: Ciara Loftus, dev

On 3/10/2021 7:48 AM, Ciara Loftus wrote:
> Single-core performance of AF_XDP at high loads can be poor because
> a heavily loaded NAPI context will never enter or allow for busy-polling.
> 
> 1C testpmd rxonly (both IRQs and PMD on core 0):
> ./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
> --forward-mode=rxonly
> 0.088Mpps
> 
> In order to achieve decent performance at high loads, it is currently
> recommended ensure the IRQs for the netdev queue and the core running
> the PMD are different.
> 
> 2C testpmd rxonly (IRQs on core 0, PMD on core 1):
> ./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=0 -- \
> --forward-mode=rxonly
> 19.26Mpps
> 
> However using an extra core is of course not ideal. The SO_PREFER_BUSY_POLL
> socket option was introduced in kernel v5.11 to help improve 1C performance.
> See [1].
> 
> This series sets this socket option on xsks created with DPDK (ie. instances of
> the AF_XDP PMD) unless explicitly disabled or not supported by the kernel. It
> was chosen to be enabled by default in order to bring the AF_XDP PMD in line
> with most other PMDs which execute on a single core.
> 
> The following system and netdev settings are recommended in conjunction with
> busy polling:
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 200000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
> 
> Re-running the 1C test with busy polling support and the above settings:
> ./dpdk-testpmd -l 0-1 --vdev=net_af_xdp0,iface=eth0 --main-lcore=1 -- \
> --forward-mode=rxonly
> 10.45Mpps
> 
> A new vdev arg is introduced called 'busy_budget' whose default value is 64.
> busy_budget is the value supplied to the kernel with the SO_BUSY_POLL_BUDGET
> socket option and represents the busy-polling NAPI budget ie. the number of
> packets the kernel will attempt to process in the netdev's NAPI context.
> 
> To set the busy budget to 256:
> ./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=256
> 14.06Mpps
> 
> If you still wish to run using 2 cores (one for PMD once for IRQs) it is
> recommended to disable busy polling to achieve optimal 2C performance:
> ./dpdk-testpmd --vdev=net_af_xdp0,iface=eth0,busy_budget=0
> 19.09Mpps
> 
> v2->v3:
> * Moved release notes update to correct location
> * Changed busy_budget from uint32_t to int since this is the type expected
> by setsockopt
> * Validate busy_budget arg is <= UINT16_MAX during parse
> 
> v1->v2:
> * Set batch size to default size of ring (2048)
> * Split batches > 2048 into multiples of 2048 or less and process all
> packets in the same manner that is done for other drivers eg. ixgbe:
> http://code.dpdk.org/dpdk/v21.02/source/drivers/net/ixgbe/ixgbe_rxtx.c#L318
> * Update commit log with reasoning behing batching changes
> * Update release notes with note on busy polling support
> * Fix return type for sycall_needed function when the wakeup flag is not
> present
> * Apprpriate log leveling
> * Set default_*xportconf burst sizes to the default busy budget size (64)
> * Detect support for busy polling via setsockopt instead of using the presence
> of the flag
> 
> RFC->v1:
> * Fixed behaviour of busy_budget=0
> * Ensure we bail out any of the new setsockopts fail
> 
> [1] https://lwn.net/Articles/837010/
> 
> 
> Ciara Loftus (3):
>    net/af_xdp: allow bigger batch sizes
>    net/af_xdp: Use recvfrom() instead of poll()
>    net/af_xdp: preferred busy polling
> 

for series,
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>

Series applied to dpdk-next-net/main, thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-03-10 17:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18  9:23 [dpdk-dev] [PATCH RFC 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
2021-02-18  9:23 ` [dpdk-dev] [PATCH RFC 3/3] net/af_xdp: preferred busy polling Ciara Loftus
2021-02-24 11:18 ` [dpdk-dev] [PATCH 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
2021-02-24 11:18   ` [dpdk-dev] [PATCH 1/3] net/af_xdp: Increase max batch size to 512 Ciara Loftus
2021-03-01 16:31     ` Ferruh Yigit
2021-03-03 15:07       ` Loftus, Ciara
2021-03-03 15:38         ` Ferruh Yigit
2021-02-24 11:18   ` [dpdk-dev] [PATCH 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
2021-02-24 11:18   ` [dpdk-dev] [PATCH 3/3] net/af_xdp: preferred busy polling Ciara Loftus
2021-03-01 16:32     ` Ferruh Yigit
2021-03-04 12:26       ` Loftus, Ciara
2021-03-08 15:54         ` Loftus, Ciara
2021-03-09 10:19   ` [dpdk-dev] [PATCH v2 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
2021-03-09 16:33       ` Ferruh Yigit
2021-03-10  7:21         ` Loftus, Ciara
2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
2021-03-09 10:19     ` [dpdk-dev] [PATCH v2 3/3] net/af_xdp: preferred busy polling Ciara Loftus
2021-03-09 16:33       ` Ferruh Yigit
2021-03-10  7:55         ` Loftus, Ciara
2021-03-10  7:48     ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ciara Loftus
2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 1/3] net/af_xdp: allow bigger batch sizes Ciara Loftus
2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 2/3] net/af_xdp: Use recvfrom() instead of poll() Ciara Loftus
2021-03-10  7:48       ` [dpdk-dev] [PATCH v3 3/3] net/af_xdp: preferred busy polling Ciara Loftus
2021-03-10 17:50       ` [dpdk-dev] [PATCH v3 0/3] AF_XDP Preferred Busy Polling Ferruh Yigit

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git