DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
@ 2021-12-24 16:46 Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 1/4] net/i40e: enable direct re-arm mode Feifei Wang
                   ` (7 more replies)
  0 siblings, 8 replies; 67+ messages in thread
From: Feifei Wang @ 2021-12-24 16:46 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraint:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
  Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
  side.
3)L3fwd application is hacked to do the mapping for the following command:
  one core two flows case:
  $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
  -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
  where:
  Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
  Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0

Testing status:
1)Tested L3fwd with the above command:
The testing results for L3fwd are as follows:
-------------------------------------------------------------------
N1SDP:
Base performance(with this patch)   with direct re-arm mode enabled
      0%                                  +14.1%

Ampere Altra:
Base performance(with this patch)   with direct re-arm mode enabled
      0%                                  +17.1%
-------------------------------------------------------------------
This patch can not affect performance of normal mode, and if enable
direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
and ampera-altra.

Feedback requested:
1) Has anyone done any similar experiments, any lessons learnt?
2) Feedback on API

Next steps:
1) Update the code for supporting 1:N(Rx : TX) mapping
2) Automate the configuration in L3fwd sample application

Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (4):
  net/i40e: enable direct re-arm mode
  ethdev: add API for direct re-arm mode
  net/i40e: add direct re-arm mode internal API
  examples/l3fwd: give an example for direct rearm mode

 drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
 drivers/net/i40e/i40e_rxtx.h          |   4 +
 drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 +++++++++++++++++++++++++-
 examples/l3fwd/main.c                 |   3 +
 lib/ethdev/ethdev_driver.h            |  15 +++
 lib/ethdev/rte_ethdev.c               |  14 +++
 lib/ethdev/rte_ethdev.h               |  31 ++++++
 lib/ethdev/version.map                |   3 +
 8 files changed, 251 insertions(+), 2 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH v1 1/4] net/i40e: enable direct re-arm mode
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
@ 2021-12-24 16:46 ` Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 2/4] ethdev: add API for " Feifei Wang
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2021-12-24 16:46 UTC (permalink / raw)
  To: Beilei Xing, Ruifeng Wang; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli

For i40e driver, enable direct re-arm mode. This patch supports the
case of mapping Rx/Tx queues from the same single lcore.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx.h          |   4 +
 drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 +++++++++++++++++++++++++-
 2 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..1fdf4305f4 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -102,6 +102,8 @@ struct i40e_rx_queue {
 
 	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
 	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint16_t direct_rxrearm_port; /** device TX port ID for direct re-arm mode */
+	uint16_t direct_rxrearm_queue; /** TX queue index for direct re-arm mode */
 	uint64_t mbuf_initializer; /**< value to init mbufs */
 
 	uint16_t port_id; /**< device port ID */
@@ -121,6 +123,8 @@ struct i40e_rx_queue {
 	uint16_t rx_using_sse; /**<flag indicate the usage of vPMD for rx */
 	uint8_t dcb_tc;         /**< Traffic class of rx queue */
 	uint64_t offloads; /**< Rx offload flags of RTE_ETH_RX_OFFLOAD_* */
+	/**<  0 if direct re-arm mode disabled, 1 when enabled */
+	bool direct_rxrearm_enable;
 	const struct rte_memzone *mz;
 };
 
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index b951ea2dc3..72bac3fb40 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -77,6 +77,147 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
 }
 
+static inline void
+i40e_rxq_rearm_direct_single(struct i40e_rx_queue *rxq)
+{
+	struct rte_eth_dev *dev;
+	struct i40e_tx_queue *txq;
+	volatile union i40e_rx_desc *rxdp;
+	struct i40e_tx_entry *txep;
+	struct i40e_rx_entry *rxep;
+	uint16_t tx_port_id, tx_queue_id;
+	uint16_t rx_id;
+	struct rte_mbuf *mb0, *mb1, *m;
+	uint64x2_t dma_addr0, dma_addr1;
+	uint64x2_t zero = vdupq_n_u64(0);
+	uint64_t paddr;
+	uint16_t i, n;
+	uint16_t nb_rearm = 0;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	tx_port_id = rxq->direct_rxrearm_port;
+	tx_queue_id = rxq->direct_rxrearm_queue;
+	dev = &rte_eth_devices[tx_port_id];
+	txq = dev->data->tx_queues[tx_queue_id];
+
+	/* tx_rs_thresh must be equal to
+	 * RTE_I40E_RXQ_REARM_THRESH in
+	 * direct re-arm mode due to
+	 * tx_next_dd update based on the
+	 * number of free buffers in the
+	 * next time
+	 */
+	n = RTE_I40E_RXQ_REARM_THRESH;
+
+	if (txq->nb_tx_free < txq->tx_free_thresh) {
+		/* check DD bits on threshold descriptor */
+		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+			goto mempool_bulk;
+		}
+
+		/* first buffer to free from S/W ring is at index
+		 * tx_next_dd - (tx_rs_thresh-1)
+		 */
+		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+		if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
+			/* directly put mbufs from Tx to Rx,
+			 * and initialize the mbufs in vector,
+			 * process 2 mbufs in one loop
+			 */
+			for (i = 0; i < n; i += 2, rxep += 2, txep += 2) {
+				rxep[0].mbuf = txep[0].mbuf;
+				rxep[1].mbuf = txep[1].mbuf;
+
+				/* Initialize rxdp descs */
+				mb0 = txep[0].mbuf;
+				mb1 = txep[1].mbuf;
+
+				paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+				dma_addr0 = vdupq_n_u64(paddr);
+				/* flush desc with pa dma_addr */
+				vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+
+				paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+				dma_addr1 = vdupq_n_u64(paddr);
+				/* flush desc with pa dma_addr */
+				vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
+			}
+		} else {
+			for (i = 0; i < n; i++) {
+				m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+				if (m != NULL) {
+					rxep[i].mbuf = m;
+
+					/* Initialize rxdp descs */
+					paddr = m->buf_iova + RTE_PKTMBUF_HEADROOM;
+					dma_addr0 = vdupq_n_u64(paddr);
+					/* flush desc with pa dma_addr */
+					vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+					nb_rearm++;
+				}
+			}
+			n = nb_rearm;
+		}
+
+		/* update counters for Tx */
+		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + RTE_I40E_RXQ_REARM_THRESH);
+		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + RTE_I40E_RXQ_REARM_THRESH);
+		if (txq->tx_next_dd >= txq->nb_tx_desc)
+			txq->tx_next_dd = (uint16_t)(RTE_I40E_RXQ_REARM_THRESH - 1);
+	} else {
+mempool_bulk:
+		/* if TX did not free bufs into Rx sw-ring,
+		 * get new bufs from mempool
+		 */
+		if (unlikely(rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0)) {
+			if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
+				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+					rxep[i].mbuf = &rxq->fake_mbuf;
+					vst1q_u64((uint64_t *)&rxdp[i].read, zero);
+				}
+			}
+			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed += n;
+			return;
+		}
+
+		/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+		for (i = 0; i < n; i += 2, rxep += 2) {
+			mb0 = rxep[0].mbuf;
+			mb1 = rxep[1].mbuf;
+
+			paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+			dma_addr0 = vdupq_n_u64(paddr);
+			/* flush desc with pa dma_addr */
+			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+
+			paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+			dma_addr1 = vdupq_n_u64(paddr);
+			/* flush desc with pa dma_addr */
+			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
+		}
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += n;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = 0;
+		rx_id = rxq->nb_rx_desc - 1;
+	}
+
+	rxq->rxrearm_nb -= n;
+
+	rte_io_wmb();
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
 static inline void
 desc_to_olflags_v(struct i40e_rx_queue *rxq, uint64x2_t descs[4],
 		  struct rte_mbuf **rx_pkts)
@@ -244,8 +385,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 	/* See if we need to rearm the RX queue - gives the prefetch a bit
 	 * of time to act
 	 */
-	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
-		i40e_rxq_rearm(rxq);
+	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+		if (rxq->direct_rxrearm_enable)
+			i40e_rxq_rearm_direct_single(rxq);
+		else
+			i40e_rxq_rearm(rxq);
+	}
 
 	/* Before we start moving massive data around, check to see if
 	 * there is actually a packet available
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH v1 2/4] ethdev: add API for direct re-arm mode
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 1/4] net/i40e: enable direct re-arm mode Feifei Wang
@ 2021-12-24 16:46 ` Feifei Wang
  2021-12-24 19:38   ` Stephen Hemminger
  2021-12-24 16:46 ` [RFC PATCH v1 3/4] net/i40e: add direct re-arm mode internal API Feifei Wang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2021-12-24 16:46 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Add API for enabling direct re-arm mode and for mapping RX and TX
queues. Currently, the API supports 1:1(txq : rxq) mapping.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
 lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
 lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
 lib/ethdev/version.map     |  3 +++
 4 files changed, 63 insertions(+)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index d95605a355..87bb287a3f 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -476,6 +476,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
 typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
 				    uint16_t rx_queue_id);
 
+/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
+typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
+						uint16_t queue_id);
+
+/**< @internal map Rx/Tx queue of direct rearm mode */
+typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
+					uint16_t rx_queue_id,
+					uint16_t tx_port_id,
+					uint16_t tx_queue_id);
+
 /** @internal Release memory resources allocated by given Rx/Tx queue. */
 typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
 				    uint16_t queue_id);
@@ -1069,6 +1079,11 @@ struct eth_dev_ops {
 	/** Disable Rx queue interrupt */
 	eth_rx_disable_intr_t      rx_queue_intr_disable;
 
+	/** Enable Rx queue direct rearm mode */
+	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
+	/** Map Rx/Tx queue for direct rearm mode */
+	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
+
 	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx queue */
 	eth_queue_release_t        tx_queue_release; /**< Release Tx queue */
 	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a1d475a292..fd13e1af41 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2485,6 +2485,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
 	return eth_err(port_id, ret);
 }
 
+int
+rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id)
+{
+	struct rte_eth_dev *dev;
+
+	dev = &rte_eth_devices[rx_port_id];
+	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
+	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
+			tx_port_id, tx_queue_id);
+
+	return 0;
+}
+
 int
 rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
 {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index fa299c8ad7..6a94dc4af4 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5073,6 +5073,37 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
+ * buffers that have completed transmission on the transmit side.
+ *
+ * @note
+ *   It is assumed that the buffers have completed transmission belong to the
+ *   mempool used at the receive side, and have refcnt = 1.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ *
+ * @return
+ *   - (0) if successful.
+ */
+__rte_experimental
+int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+			       uint16_t tx_port_id, uint16_t tx_queue_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this structure may change without prior notice.
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index c2fb0669a4..6540f08698 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -256,6 +256,9 @@ EXPERIMENTAL {
 	rte_flow_flex_item_create;
 	rte_flow_flex_item_release;
 	rte_flow_pick_transfer_proxy;
+
+	# added in 22.02
+	rte_eth_direct_rxrearm_map;
 };
 
 INTERNAL {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH v1 3/4] net/i40e: add direct re-arm mode internal API
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 1/4] net/i40e: enable direct re-arm mode Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 2/4] ethdev: add API for " Feifei Wang
@ 2021-12-24 16:46 ` Feifei Wang
  2021-12-24 16:46 ` [RFC PATCH v1 4/4] examples/l3fwd: give an example for direct rearm mode Feifei Wang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2021-12-24 16:46 UTC (permalink / raw)
  To: Beilei Xing; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

For direct re-arm mode, add two internal API for i40e.

One is to enable direct re-arming mode in Rx queue.

The other is to map Tx queue with Rx queue to make Rx queue take
buffers from the specific Tx queue.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_ethdev.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index c0bfff43ee..33f89c5d9a 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -369,6 +369,13 @@ static int i40e_dev_rx_queue_intr_enable(struct rte_eth_dev *dev,
 static int i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev,
 					  uint16_t queue_id);
 
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+						uint16_t queue_id);
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+						uint16_t rx_queue_id,
+						uint16_t tx_port_id,
+						uint16_t tx_queue_id);
+
 static int i40e_get_regs(struct rte_eth_dev *dev,
 			 struct rte_dev_reg_info *regs);
 
@@ -476,6 +483,8 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.rx_queue_setup               = i40e_dev_rx_queue_setup,
 	.rx_queue_intr_enable         = i40e_dev_rx_queue_intr_enable,
 	.rx_queue_intr_disable        = i40e_dev_rx_queue_intr_disable,
+	.rx_queue_direct_rearm_enable = i40e_dev_rx_queue_direct_rearm_enable,
+	.rx_queue_direct_rearm_map    = i40e_dev_rx_queue_direct_rearm_map,
 	.rx_queue_release             = i40e_dev_rx_queue_release,
 	.tx_queue_setup               = i40e_dev_tx_queue_setup,
 	.tx_queue_release             = i40e_dev_tx_queue_release,
@@ -11115,6 +11124,31 @@ i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
 	return 0;
 }
 
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+			uint16_t queue_id)
+{
+	struct i40e_rx_queue *rxq;
+
+	rxq = dev->data->rx_queues[queue_id];
+	rxq->direct_rxrearm_enable = 1;
+
+	return 0;
+}
+
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+				uint16_t rx_queue_id, uint16_t tx_port_id,
+				uint16_t tx_queue_id)
+{
+	struct i40e_rx_queue *rxq;
+
+	rxq = dev->data->rx_queues[rx_queue_id];
+
+	rxq->direct_rxrearm_port = tx_port_id;
+	rxq->direct_rxrearm_queue = tx_queue_id;
+
+	return 0;
+}
+
 /**
  * This function is used to check if the register is valid.
  * Below is the valid registers list for X722 only:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [RFC PATCH v1 4/4] examples/l3fwd: give an example for direct rearm mode
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
                   ` (2 preceding siblings ...)
  2021-12-24 16:46 ` [RFC PATCH v1 3/4] net/i40e: add direct re-arm mode internal API Feifei Wang
@ 2021-12-24 16:46 ` Feifei Wang
  2021-12-26 10:25 ` [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Morten Brørup
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2021-12-24 16:46 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

This is just to give an example to show how to use API to enable direct
rearm mode for user.

Command (Two flows):
./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
-- -p 0x3 -P --config='(0,0,1),(1,0,1)'

This is one single core case, and by using API, The Rx queue 0 from
port 0 can directly rearm buffers from port 1 Tx queue 0.
And Rx queue 0 from port 1 can directly rearm buffers from port 0 Tx
queue 0.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index eb68ffc5aa..e7801b9f04 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -1439,6 +1439,9 @@ main(int argc, char **argv)
 		}
 	}
 
+	rte_eth_direct_rxrearm_map(0, 0, 1, 0);
+	rte_eth_direct_rxrearm_map(1, 0, 0, 0);
+
 	printf("\n");
 
 	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [RFC PATCH v1 2/4] ethdev: add API for direct re-arm mode
  2021-12-24 16:46 ` [RFC PATCH v1 2/4] ethdev: add API for " Feifei Wang
@ 2021-12-24 19:38   ` Stephen Hemminger
  2021-12-26  9:49     ` 回复: " Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Stephen Hemminger @ 2021-12-24 19:38 UTC (permalink / raw)
  To: Feifei Wang
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	dev, nd, Honnappa Nagarahalli, Ruifeng Wang

On Sat, 25 Dec 2021 00:46:10 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:

> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	dev = &rte_eth_devices[rx_port_id];
> +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> +			tx_port_id, tx_queue_id);
> +

Indirect calls are expensive, maybe better to combine these?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* 回复: [RFC PATCH v1 2/4] ethdev: add API for direct re-arm mode
  2021-12-24 19:38   ` Stephen Hemminger
@ 2021-12-26  9:49     ` Feifei Wang
  2021-12-26 10:31       ` Morten Brørup
  0 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2021-12-26  9:49 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dev, nd,
	Honnappa Nagarahalli, Ruifeng Wang, nd


> -----邮件原件-----
> 发件人: Stephen Hemminger <stephen@networkplumber.org>
> 发送时间: Saturday, December 25, 2021 3:38 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; dev@dpdk.org; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [RFC PATCH v1 2/4] ethdev: add API for direct re-arm mode
> 
> On Sat, 25 Dec 2021 00:46:10 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
> 
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	dev = &rte_eth_devices[rx_port_id];
> > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > +			tx_port_id, tx_queue_id);
> > +
> 
> Indirect calls are expensive, maybe better to combine these?
Thanks for your comment. I'm a little confused about this.

Whether 'expensive' is due to 'enable direct_rearm mode' and 'map queue' use
different api, and we should combine them into the same API.
If this, I think you are right, and just one api is enough to enable direct rearm mode and
map queue.
 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
                   ` (3 preceding siblings ...)
  2021-12-24 16:46 ` [RFC PATCH v1 4/4] examples/l3fwd: give an example for direct rearm mode Feifei Wang
@ 2021-12-26 10:25 ` Morten Brørup
  2021-12-28  6:55   ` 回复: " Feifei Wang
  2022-01-27  4:06   ` Honnappa Nagarahalli
  2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 67+ messages in thread
From: Morten Brørup @ 2021-12-26 10:25 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Friday, 24 December 2021 17.46
> 
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the
> 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
> 
> However, this solution poses several constraint:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will
> have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.

I disagree with the description of this constraint.

As I understand it, it doesn't matter now many ports or queues are in a NIC or system.

The constraint is more narrow:

This patch requires that all packets ingressing on some port/queue must egress on the specific port/queue that it has been configured to ream its buffers from. I.e. an application cannot route packets between multiple ports with this patch.

> 
> 2)The other factor that needs to be considered is 'run-to-completion'
> vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the
> pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at
> this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 

You are missing the fourth constraint:

4) The application must transmit all received packets immediately, i.e. QoS queueing and similar is prohibited.

> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
>   Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
>   side.
> 3)L3fwd application is hacked to do the mapping for the following
> command:
>   one core two flows case:
>   $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
>   -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
>   where:
>   Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
>   Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> 
> Testing status:
> 1)Tested L3fwd with the above command:
> The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> N1SDP:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +14.1%
> 
> Ampere Altra:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +17.1%
> -------------------------------------------------------------------
> This patch can not affect performance of normal mode, and if enable
> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> and ampera-altra.
> 
> Feedback requested:
> 1) Has anyone done any similar experiments, any lessons learnt?
> 2) Feedback on API
> 
> Next steps:
> 1) Update the code for supporting 1:N(Rx : TX) mapping
> 2) Automate the configuration in L3fwd sample application
> 
> Reference:
> [1] https://store.nvidia.com/en-
> us/networking/store/product/MCX623105AN-
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> adapters/100gb-nic-ocp/n1100g
> 
> Feifei Wang (4):
>   net/i40e: enable direct re-arm mode
>   ethdev: add API for direct re-arm mode
>   net/i40e: add direct re-arm mode internal API
>   examples/l3fwd: give an example for direct rearm mode
> 
>  drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
>  drivers/net/i40e/i40e_rxtx.h          |   4 +
>  drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 +++++++++++++++++++++++++-
>  examples/l3fwd/main.c                 |   3 +
>  lib/ethdev/ethdev_driver.h            |  15 +++
>  lib/ethdev/rte_ethdev.c               |  14 +++
>  lib/ethdev/rte_ethdev.h               |  31 ++++++
>  lib/ethdev/version.map                |   3 +
>  8 files changed, 251 insertions(+), 2 deletions(-)
> 
> --
> 2.25.1
> 

The patch provides a significant performance improvement, but I am wondering if any real world applications exist that would use this. Only a "router on a stick" (i.e. a single-port router) comes to my mind, and that is probably sufficient to call it useful in the real world. Do you have any other examples to support the usefulness of this patch?

Anyway, the patch doesn't do any harm if unused, and the only performance cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I don't oppose to it.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [RFC PATCH v1 2/4] ethdev: add API for direct re-arm mode
  2021-12-26  9:49     ` 回复: " Feifei Wang
@ 2021-12-26 10:31       ` Morten Brørup
  0 siblings, 0 replies; 67+ messages in thread
From: Morten Brørup @ 2021-12-26 10:31 UTC (permalink / raw)
  To: Feifei Wang, Stephen Hemminger
  Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dev, nd,
	Honnappa Nagarahalli, Ruifeng Wang, nd

> From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> Sent: Sunday, 26 December 2021 10.50
> 
> > 发件人: Stephen Hemminger <stephen@networkplumber.org>
> > 发送时间: Saturday, December 25, 2021 3:38 AM
> >
> > On Sat, 25 Dec 2021 00:46:10 +0800
> > Feifei Wang <feifei.wang2@arm.com> wrote:
> >
> > > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > > +	struct rte_eth_dev *dev;
> > > +
> > > +	dev = &rte_eth_devices[rx_port_id];
> > > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> > rx_queue_id);
> > > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > > +			tx_port_id, tx_queue_id);
> > > +
> >
> > Indirect calls are expensive, maybe better to combine these?
> Thanks for your comment. I'm a little confused about this.
> 
> Whether 'expensive' is due to 'enable direct_rearm mode' and 'map
> queue' use
> different api, and we should combine them into the same API.
> If this, I think you are right, and just one api is enough to enable
> direct rearm mode and
> map queue.
> 

It's used in the control plane, so prefer readability over performance.

Also, the Ethdev API has separate _enable/_disable/_configure functions for many features like this. I would prefer the API to this feature to be similar, i.e. separate functions for enable and configure.

In other words: Disregard Stephen's comment about combining.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2021-12-26 10:25 ` [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Morten Brørup
@ 2021-12-28  6:55   ` Feifei Wang
  2022-01-18 15:51     ` Ferruh Yigit
  2022-01-27  4:06   ` Honnappa Nagarahalli
  1 sibling, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2021-12-28  6:55 UTC (permalink / raw)
  To: Morten Brørup; +Cc: dev, nd, nd

Thanks for your comments.

> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Sunday, December 26, 2021 6:25 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>
> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Friday, 24 December 2021 17.46
> >
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into it's software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache.
> >
> > However, this solution poses several constraint:
> >
> > 1)The receive queue needs to know which transmit queue it should take
> > the buffers from. The application logic decides which transmit port to
> > use to send out the packets. In many use cases the NIC might have a
> > single port ([1], [2], [3]), in which case a given transmit queue is
> > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > is easy to configure.
> >
> > If the NIC has 2 ports (there are several references), then we will
> > have
> > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > However, if this is generalized to 'N' ports, the configuration can be
> > long. More over the PMD would have to scan a list of transmit queues
> > to pull the buffers from.
> 
> I disagree with the description of this constraint.
> 
> As I understand it, it doesn't matter now many ports or queues are in a NIC
> or system.
> 
> The constraint is more narrow:
> 
> This patch requires that all packets ingressing on some port/queue must
> egress on the specific port/queue that it has been configured to ream its
> buffers from. I.e. an application cannot route packets between multiple
> ports with this patch.

First, I agree with that direct-rearm mode is suitable for the case that
user should know the direction of the flow in advance and map rx/tx with
each other. It is not suitable for the normal packet random route case.

Second, our proposed two cases: one port NIC and two port NIC means the
direction of flow is determined. Furthermore, for two port NIC, there maybe two
flow directions: from port 0 to port 1, or from port 0 to port 0. Thus we need to have
1:2 (Rx queue :  Tx queue) mapping.

At last, maybe we can change our description as follows:
"The first constraint is that user should know the direction of the flow in advance,
and based on this, user needs to map the Rx and Tx queues according to the flow direction:
For example, if the NIC just has one port
 ......
Or if the NIC have two ports 
......."
 
> 
> >
> > 2)The other factor that needs to be considered is 'run-to-completion'
> > vs
> > 'pipeline' models. In the run-to-completion model, the receive side
> > and the transmit side are running on the same lcore serially. In the
> > pipeline model. The receive side and transmit side might be running on
> > different lcores in parallel. This requires locking. This is not
> > supported at this point.
> >
> > 3)Tx and Rx buffers must be from the same mempool. And we also must
> > ensure Tx buffer free number is equal to Rx buffer free number:
> > (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus, 'tx_next_dd'
> > can be updated correctly in direct-rearm mode. This is due to
> > tx_next_dd is a variable to compute tx sw-ring free location.
> > Its value will be one more round than the position where next time
> > free starts.
> >
> 
> You are missing the fourth constraint:
> 
> 4) The application must transmit all received packets immediately, i.e. QoS
> queueing and similar is prohibited.
> 

You are right and this is indeed one of the limitations.

> > Current status in this RFC:
> > 1)An API is added to allow for mapping a TX queue to a RX queue.
> >   Currently it supports 1:1 mapping.
> > 2)The i40e driver is changed to do the direct re-arm of the receive
> >   side.
> > 3)L3fwd application is hacked to do the mapping for the following
> > command:
> >   one core two flows case:
> >   $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
> >   -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
> >   where:
> >   Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
> >   Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> >
> > Testing status:
> > 1)Tested L3fwd with the above command:
> > The testing results for L3fwd are as follows:
> > -------------------------------------------------------------------
> > N1SDP:
> > Base performance(with this patch)   with direct re-arm mode enabled
> >       0%                                  +14.1%
> >
> > Ampere Altra:
> > Base performance(with this patch)   with direct re-arm mode enabled
> >       0%                                  +17.1%
> > -------------------------------------------------------------------
> > This patch can not affect performance of normal mode, and if enable
> > direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> > and ampera-altra.
> >
> > Feedback requested:
> > 1) Has anyone done any similar experiments, any lessons learnt?
> > 2) Feedback on API
> >
> > Next steps:
> > 1) Update the code for supporting 1:N(Rx : TX) mapping
> > 2) Automate the configuration in L3fwd sample application
> >
> > Reference:
> > [1] https://store.nvidia.com/en-
> > us/networking/store/product/MCX623105AN-
> >
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
> oDisabled
> > / [2]
> > https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> > ethernet-network-adapter-e810cqda1/specifications.html
> > [3] https://www.broadcom.com/products/ethernet-
> connectivity/network-
> > adapters/100gb-nic-ocp/n1100g
> >
> > Feifei Wang (4):
> >   net/i40e: enable direct re-arm mode
> >   ethdev: add API for direct re-arm mode
> >   net/i40e: add direct re-arm mode internal API
> >   examples/l3fwd: give an example for direct rearm mode
> >
> >  drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
> >  drivers/net/i40e/i40e_rxtx.h          |   4 +
> >  drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
> +++++++++++++++++++++++++-
> >  examples/l3fwd/main.c                 |   3 +
> >  lib/ethdev/ethdev_driver.h            |  15 +++
> >  lib/ethdev/rte_ethdev.c               |  14 +++
> >  lib/ethdev/rte_ethdev.h               |  31 ++++++
> >  lib/ethdev/version.map                |   3 +
> >  8 files changed, 251 insertions(+), 2 deletions(-)
> >
> > --
> > 2.25.1
> >
> 
> The patch provides a significant performance improvement, but I am
> wondering if any real world applications exist that would use this. Only a
> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> probably sufficient to call it useful in the real world. Do you have any other
> examples to support the usefulness of this patch?
> 
One case I have is about network security. For network firewall, all packets need
to ingress on the specified port and egress on the specified port to do packet filtering.
In this case, we can know flow direction in advance.

> Anyway, the patch doesn't do any harm if unused, and the only performance
> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
> don't oppose to it.
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2021-12-28  6:55   ` 回复: " Feifei Wang
@ 2022-01-18 15:51     ` Ferruh Yigit
  2022-01-18 16:53       ` Thomas Monjalon
  2023-02-28  6:43       ` 回复: " Feifei Wang
  0 siblings, 2 replies; 67+ messages in thread
From: Ferruh Yigit @ 2022-01-18 15:51 UTC (permalink / raw)
  To: Feifei Wang, Morten Brørup
  Cc: dev, nd, Thomas Monjalon, Andrew Rybchenko, Qi Zhang, Beilei Xing

On 12/28/2021 6:55 AM, Feifei Wang wrote:
> Thanks for your comments.
> 
>> -----邮件原件-----
>> 发件人: Morten Brørup <mb@smartsharesystems.com>
>> 发送时间: Sunday, December 26, 2021 6:25 PM
>> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
>> 抄送: dev@dpdk.org; nd <nd@arm.com>
>> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
>>
>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>> Sent: Friday, 24 December 2021 17.46
>>>
>>> Currently, the transmit side frees the buffers into the lcore cache
>>> and the receive side allocates buffers from the lcore cache. The
>>> transmit side typically frees 32 buffers resulting in 32*8=256B of
>>> stores to lcore cache. The receive side allocates 32 buffers and
>>> stores them in the receive side software ring, resulting in 32*8=256B
>>> of stores and 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the
>>> 256B of loads and stores introduced by the lcore cache. It also frees
>>> up the cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraint:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will
>>> have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues
>>> to pull the buffers from.
>>
>> I disagree with the description of this constraint.
>>
>> As I understand it, it doesn't matter now many ports or queues are in a NIC
>> or system.
>>
>> The constraint is more narrow:
>>
>> This patch requires that all packets ingressing on some port/queue must
>> egress on the specific port/queue that it has been configured to ream its
>> buffers from. I.e. an application cannot route packets between multiple
>> ports with this patch.
> 
> First, I agree with that direct-rearm mode is suitable for the case that
> user should know the direction of the flow in advance and map rx/tx with
> each other. It is not suitable for the normal packet random route case.
> 
> Second, our proposed two cases: one port NIC and two port NIC means the
> direction of flow is determined. Furthermore, for two port NIC, there maybe two
> flow directions: from port 0 to port 1, or from port 0 to port 0. Thus we need to have
> 1:2 (Rx queue :  Tx queue) mapping.
> 
> At last, maybe we can change our description as follows:
> "The first constraint is that user should know the direction of the flow in advance,
> and based on this, user needs to map the Rx and Tx queues according to the flow direction:
> For example, if the NIC just has one port
>   ......
> Or if the NIC have two ports
> ......."
>   
>>
>>>
>>> 2)The other factor that needs to be considered is 'run-to-completion'
>>> vs
>>> 'pipeline' models. In the run-to-completion model, the receive side
>>> and the transmit side are running on the same lcore serially. In the
>>> pipeline model. The receive side and transmit side might be running on
>>> different lcores in parallel. This requires locking. This is not
>>> supported at this point.
>>>
>>> 3)Tx and Rx buffers must be from the same mempool. And we also must
>>> ensure Tx buffer free number is equal to Rx buffer free number:
>>> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus, 'tx_next_dd'
>>> can be updated correctly in direct-rearm mode. This is due to
>>> tx_next_dd is a variable to compute tx sw-ring free location.
>>> Its value will be one more round than the position where next time
>>> free starts.
>>>
>>
>> You are missing the fourth constraint:
>>
>> 4) The application must transmit all received packets immediately, i.e. QoS
>> queueing and similar is prohibited.
>>
> 
> You are right and this is indeed one of the limitations.
> 
>>> Current status in this RFC:
>>> 1)An API is added to allow for mapping a TX queue to a RX queue.
>>>    Currently it supports 1:1 mapping.
>>> 2)The i40e driver is changed to do the direct re-arm of the receive
>>>    side.
>>> 3)L3fwd application is hacked to do the mapping for the following
>>> command:
>>>    one core two flows case:
>>>    $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
>>>    -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
>>>    where:
>>>    Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
>>>    Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
>>>
>>> Testing status:
>>> 1)Tested L3fwd with the above command:
>>> The testing results for L3fwd are as follows:
>>> -------------------------------------------------------------------
>>> N1SDP:
>>> Base performance(with this patch)   with direct re-arm mode enabled
>>>        0%                                  +14.1%
>>>
>>> Ampere Altra:
>>> Base performance(with this patch)   with direct re-arm mode enabled
>>>        0%                                  +17.1%
>>> -------------------------------------------------------------------
>>> This patch can not affect performance of normal mode, and if enable
>>> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
>>> and ampera-altra.
>>>
>>> Feedback requested:
>>> 1) Has anyone done any similar experiments, any lessons learnt?
>>> 2) Feedback on API
>>>
>>> Next steps:
>>> 1) Update the code for supporting 1:N(Rx : TX) mapping
>>> 2) Automate the configuration in L3fwd sample application
>>>
>>> Reference:
>>> [1] https://store.nvidia.com/en-
>>> us/networking/store/product/MCX623105AN-
>>>
>> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
>> oDisabled
>>> / [2]
>>> https://www.intel.com/content/www/us/en/products/sku/192561/intel-
>>> ethernet-network-adapter-e810cqda1/specifications.html
>>> [3] https://www.broadcom.com/products/ethernet-
>> connectivity/network-
>>> adapters/100gb-nic-ocp/n1100g
>>>
>>> Feifei Wang (4):
>>>    net/i40e: enable direct re-arm mode
>>>    ethdev: add API for direct re-arm mode
>>>    net/i40e: add direct re-arm mode internal API
>>>    examples/l3fwd: give an example for direct rearm mode
>>>
>>>   drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
>>>   drivers/net/i40e/i40e_rxtx.h          |   4 +
>>>   drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
>> +++++++++++++++++++++++++-
>>>   examples/l3fwd/main.c                 |   3 +
>>>   lib/ethdev/ethdev_driver.h            |  15 +++
>>>   lib/ethdev/rte_ethdev.c               |  14 +++
>>>   lib/ethdev/rte_ethdev.h               |  31 ++++++
>>>   lib/ethdev/version.map                |   3 +
>>>   8 files changed, 251 insertions(+), 2 deletions(-)
>>>
>>> --
>>> 2.25.1
>>>
>>
>> The patch provides a significant performance improvement, but I am
>> wondering if any real world applications exist that would use this. Only a
>> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
>> probably sufficient to call it useful in the real world. Do you have any other
>> examples to support the usefulness of this patch?
>>
> One case I have is about network security. For network firewall, all packets need
> to ingress on the specified port and egress on the specified port to do packet filtering.
> In this case, we can know flow direction in advance.
> 

I also have some concerns on how useful this API will be in real life,
and does the use case worth the complexity it brings.
And it looks too much low level detail for the application.

cc'ed a few more folks for comment.

>> Anyway, the patch doesn't do any harm if unused, and the only performance
>> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
>> don't oppose to it.
>>
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-18 15:51     ` Ferruh Yigit
@ 2022-01-18 16:53       ` Thomas Monjalon
  2022-01-18 17:27         ` Morten Brørup
  2022-01-27  5:16         ` Honnappa Nagarahalli
  2023-02-28  6:43       ` 回复: " Feifei Wang
  1 sibling, 2 replies; 67+ messages in thread
From: Thomas Monjalon @ 2022-01-18 16:53 UTC (permalink / raw)
  To: Feifei Wang, Morten Brørup, Ferruh Yigit
  Cc: dev, nd, Andrew Rybchenko, Qi Zhang, Beilei Xing

[quick summary: ethdev API to bypass mempool]

18/01/2022 16:51, Ferruh Yigit:
> On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > Morten Brørup <mb@smartsharesystems.com>:
> >> The patch provides a significant performance improvement, but I am
> >> wondering if any real world applications exist that would use this. Only a
> >> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> >> probably sufficient to call it useful in the real world. Do you have any other
> >> examples to support the usefulness of this patch?
> >>
> > One case I have is about network security. For network firewall, all packets need
> > to ingress on the specified port and egress on the specified port to do packet filtering.
> > In this case, we can know flow direction in advance.
> 
> I also have some concerns on how useful this API will be in real life,
> and does the use case worth the complexity it brings.
> And it looks too much low level detail for the application.

That's difficult to judge.
The use case is limited and the API has some severe limitations.
The benefit is measured with l3fwd, which is not exactly a real app.
Do we want an API which improves performance in limited scenarios
at the cost of breaking some general design assumptions?

Can we achieve the same level of performance with a mempool trick?



^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-18 16:53       ` Thomas Monjalon
@ 2022-01-18 17:27         ` Morten Brørup
  2022-01-27  5:24           ` Honnappa Nagarahalli
  2022-01-27  5:16         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 67+ messages in thread
From: Morten Brørup @ 2022-01-18 17:27 UTC (permalink / raw)
  To: Thomas Monjalon, Feifei Wang, Ferruh Yigit
  Cc: dev, nd, Andrew Rybchenko, Qi Zhang, Beilei Xing

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, 18 January 2022 17.54
> 
> [quick summary: ethdev API to bypass mempool]
> 
> 18/01/2022 16:51, Ferruh Yigit:
> > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > Morten Brørup <mb@smartsharesystems.com>:
> > >> The patch provides a significant performance improvement, but I am
> > >> wondering if any real world applications exist that would use
> this. Only a
> > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> and that is
> > >> probably sufficient to call it useful in the real world. Do you
> have any other
> > >> examples to support the usefulness of this patch?
> > >>
> > > One case I have is about network security. For network firewall,
> all packets need
> > > to ingress on the specified port and egress on the specified port
> to do packet filtering.
> > > In this case, we can know flow direction in advance.
> >
> > I also have some concerns on how useful this API will be in real
> life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
> 
> That's difficult to judge.
> The use case is limited and the API has some severe limitations.
> The benefit is measured with l3fwd, which is not exactly a real app.
> Do we want an API which improves performance in limited scenarios
> at the cost of breaking some general design assumptions?
> 
> Can we achieve the same level of performance with a mempool trick?

Perhaps the mbuf library could offer bulk functions for alloc/free of raw mbufs - essentially a shortcut directly to the mempool library.

There might be a few more details to micro-optimize in the mempool library, if approached with this use case in mind. E.g. the rte_mempool_default_cache() could do with a few unlikely() in its comparisons.

Also, for this use case, the mempool library adds tracing overhead, which this API bypasses. And considering how short the code path through the mempool cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs. trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).

A key optimization point could be the number of mbufs being moved to/from the mempool cache. If that number was fixed at compile time, a faster memcpy() could be used. However, it seems that different PMDs use bursts of either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple detail.

Overall, I strongly agree that it is preferable to optimize the core libraries, rather than bypass them. Bypassing will eventually lead to "spaghetti code".


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2021-12-26 10:25 ` [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Morten Brørup
  2021-12-28  6:55   ` 回复: " Feifei Wang
@ 2022-01-27  4:06   ` Honnappa Nagarahalli
  2022-01-27 17:13     ` Morten Brørup
  2022-01-28 11:29     ` Morten Brørup
  1 sibling, 2 replies; 67+ messages in thread
From: Honnappa Nagarahalli @ 2022-01-27  4:06 UTC (permalink / raw)
  To: Morten Brørup, Feifei Wang; +Cc: dev, nd, Honnappa Nagarahalli, nd

Thanks Morten, appreciate your comments. Few responses inline.

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Sunday, December 26, 2021 4:25 AM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>
> Subject: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Friday, 24 December 2021 17.46
> >
<snip>

> >
> > However, this solution poses several constraint:
> >
> > 1)The receive queue needs to know which transmit queue it should take
> > the buffers from. The application logic decides which transmit port to
> > use to send out the packets. In many use cases the NIC might have a
> > single port ([1], [2], [3]), in which case a given transmit queue is
> > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > is easy to configure.
> >
> > If the NIC has 2 ports (there are several references), then we will
> > have
> > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > However, if this is generalized to 'N' ports, the configuration can be
> > long. More over the PMD would have to scan a list of transmit queues
> > to pull the buffers from.
> 
> I disagree with the description of this constraint.
> 
> As I understand it, it doesn't matter now many ports or queues are in a NIC or
> system.
> 
> The constraint is more narrow:
> 
> This patch requires that all packets ingressing on some port/queue must
> egress on the specific port/queue that it has been configured to ream its
> buffers from. I.e. an application cannot route packets between multiple ports
> with this patch.
Agree, this patch as is has this constraint. It is not a constraint that would apply for NICs with single port. The above text is describing some of the issues associated with generalizing the solution for N number of ports. If N is small, the configuration is small and scanning should not be bad.

> 
> >

<snip>

> >
> 
> You are missing the fourth constraint:
> 
> 4) The application must transmit all received packets immediately, i.e. QoS
> queueing and similar is prohibited.
I do not understand this, can you please elaborate?. Even if there is QoS queuing, there would be steady stream of packets being transmitted. These transmitted packets will fill the buffers on the RX side.

> 
<snip>

> >
> 
> The patch provides a significant performance improvement, but I am
> wondering if any real world applications exist that would use this. Only a
> "router on a stick" (i.e. a single-port router) comes to my mind, and that is
> probably sufficient to call it useful in the real world. Do you have any other
> examples to support the usefulness of this patch?
SmartNIC is a clear and dominant use case, typically they have a single port for data plane traffic (dual ports are mostly for redundancy)
This patch avoids good amount of store operations. The smaller CPUs found in SmartNICs have smaller store buffers which can become bottlenecks. Avoiding the lcore cache saves valuable HW cache space.

> 
> Anyway, the patch doesn't do any harm if unused, and the only performance
> cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I
> don't oppose to it.
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-18 16:53       ` Thomas Monjalon
  2022-01-18 17:27         ` Morten Brørup
@ 2022-01-27  5:16         ` Honnappa Nagarahalli
  1 sibling, 0 replies; 67+ messages in thread
From: Honnappa Nagarahalli @ 2022-01-27  5:16 UTC (permalink / raw)
  To: thomas, Feifei Wang, Morten Brørup, Ferruh Yigit
  Cc: dev, nd, Andrew Rybchenko, Qi Zhang, Beilei Xing,
	Honnappa Nagarahalli, nd

<snip>

> 
> [quick summary: ethdev API to bypass mempool]
> 
> 18/01/2022 16:51, Ferruh Yigit:
> > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > Morten Brørup <mb@smartsharesystems.com>:
> > >> The patch provides a significant performance improvement, but I am
> > >> wondering if any real world applications exist that would use this.
> > >> Only a "router on a stick" (i.e. a single-port router) comes to my
> > >> mind, and that is probably sufficient to call it useful in the real
> > >> world. Do you have any other examples to support the usefulness of this
> patch?
> > >>
> > > One case I have is about network security. For network firewall, all
> > > packets need to ingress on the specified port and egress on the specified
> port to do packet filtering.
> > > In this case, we can know flow direction in advance.
> >
> > I also have some concerns on how useful this API will be in real life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
I think the application writer already needs to know many low level details to be able to extract performance out of PMDs. For ex: fast free, 

> 
> That's difficult to judge.
> The use case is limited and the API has some severe limitations.
The use case applies for SmartNICs which is a major use case. In terms of limitations, it depends on how one sees it. For ex: lcore cache is not applicable to pipeline mode, but it is still accepted as it is helpful for something else.

> The benefit is measured with l3fwd, which is not exactly a real app.
It is funny how we treat l3fwd. When it shows performance improvement, we treat it as 'not a real application'. When it shows (even a small) performance drop, the patches are not accepted. We need to make up our mind 😊

> Do we want an API which improves performance in limited scenarios at the
> cost of breaking some general design assumptions?
It is not breaking any existing design assumptions. It is a very well suited optimization for SmartNIC use case. For this use case, it does not make sense for the same thread to copy data to a temp location (lcore cache), read it immediately and store it in another location. It is a waste of CPU cycles and memory bandwidth.

> 
> Can we achieve the same level of performance with a mempool trick?
We cannot as this patch basically avoids memory loads and stores (which reduces the backend stalls) caused by the temporary storage in lcore cache.

> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-18 17:27         ` Morten Brørup
@ 2022-01-27  5:24           ` Honnappa Nagarahalli
  2022-01-27 16:45             ` Ananyev, Konstantin
  0 siblings, 1 reply; 67+ messages in thread
From: Honnappa Nagarahalli @ 2022-01-27  5:24 UTC (permalink / raw)
  To: Morten Brørup, thomas, Feifei Wang, Ferruh Yigit
  Cc: dev, nd, Andrew Rybchenko, Qi Zhang, Beilei Xing,
	Honnappa Nagarahalli, nd

<snip>

> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Tuesday, 18 January 2022 17.54
> >
> > [quick summary: ethdev API to bypass mempool]
> >
> > 18/01/2022 16:51, Ferruh Yigit:
> > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > Morten Brørup <mb@smartsharesystems.com>:
> > > >> The patch provides a significant performance improvement, but I
> > > >> am wondering if any real world applications exist that would use
> > this. Only a
> > > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> > and that is
> > > >> probably sufficient to call it useful in the real world. Do you
> > have any other
> > > >> examples to support the usefulness of this patch?
> > > >>
> > > > One case I have is about network security. For network firewall,
> > all packets need
> > > > to ingress on the specified port and egress on the specified port
> > to do packet filtering.
> > > > In this case, we can know flow direction in advance.
> > >
> > > I also have some concerns on how useful this API will be in real
> > life,
> > > and does the use case worth the complexity it brings.
> > > And it looks too much low level detail for the application.
> >
> > That's difficult to judge.
> > The use case is limited and the API has some severe limitations.
> > The benefit is measured with l3fwd, which is not exactly a real app.
> > Do we want an API which improves performance in limited scenarios at
> > the cost of breaking some general design assumptions?
> >
> > Can we achieve the same level of performance with a mempool trick?
> 
> Perhaps the mbuf library could offer bulk functions for alloc/free of raw
> mbufs - essentially a shortcut directly to the mempool library.
> 
> There might be a few more details to micro-optimize in the mempool library,
> if approached with this use case in mind. E.g. the
> rte_mempool_default_cache() could do with a few unlikely() in its
> comparisons.
> 
> Also, for this use case, the mempool library adds tracing overhead, which this
> API bypasses. And considering how short the code path through the mempool
> cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs.
> trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> 
> A key optimization point could be the number of mbufs being moved to/from
> the mempool cache. If that number was fixed at compile time, a faster
> memcpy() could be used. However, it seems that different PMDs use bursts of
> either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple
> detail.
This patch removes the stores and loads which saves on the backend cycles. I do not think, other optimizations can do the same.

> 
> Overall, I strongly agree that it is preferable to optimize the core libraries,
> rather than bypass them. Bypassing will eventually lead to "spaghetti code".
IMO, this is not "spaghetti code". There is no design rule in DPDK that says the RX side must allocate buffers from a mempool or TX side must free buffers to a mempool. This patch does not break any modular boundaries. For ex: access internal details of another library.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-27  5:24           ` Honnappa Nagarahalli
@ 2022-01-27 16:45             ` Ananyev, Konstantin
  2022-02-02 19:46               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 67+ messages in thread
From: Ananyev, Konstantin @ 2022-01-27 16:45 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, thomas, Feifei Wang,
	Yigit, Ferruh
  Cc: dev, nd, Andrew Rybchenko, Zhang, Qi Z, Xing,  Beilei, nd



> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Tuesday, 18 January 2022 17.54
> > >
> > > [quick summary: ethdev API to bypass mempool]
> > >
> > > 18/01/2022 16:51, Ferruh Yigit:
> > > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > > Morten Brørup <mb@smartsharesystems.com>:
> > > > >> The patch provides a significant performance improvement, but I
> > > > >> am wondering if any real world applications exist that would use
> > > this. Only a
> > > > >> "router on a stick" (i.e. a single-port router) comes to my mind,
> > > and that is
> > > > >> probably sufficient to call it useful in the real world. Do you
> > > have any other
> > > > >> examples to support the usefulness of this patch?
> > > > >>
> > > > > One case I have is about network security. For network firewall,
> > > all packets need
> > > > > to ingress on the specified port and egress on the specified port
> > > to do packet filtering.
> > > > > In this case, we can know flow direction in advance.
> > > >
> > > > I also have some concerns on how useful this API will be in real
> > > life,
> > > > and does the use case worth the complexity it brings.
> > > > And it looks too much low level detail for the application.
> > >
> > > That's difficult to judge.
> > > The use case is limited and the API has some severe limitations.
> > > The benefit is measured with l3fwd, which is not exactly a real app.
> > > Do we want an API which improves performance in limited scenarios at
> > > the cost of breaking some general design assumptions?
> > >
> > > Can we achieve the same level of performance with a mempool trick?
> >
> > Perhaps the mbuf library could offer bulk functions for alloc/free of raw
> > mbufs - essentially a shortcut directly to the mempool library.
> >
> > There might be a few more details to micro-optimize in the mempool library,
> > if approached with this use case in mind. E.g. the
> > rte_mempool_default_cache() could do with a few unlikely() in its
> > comparisons.
> >
> > Also, for this use case, the mempool library adds tracing overhead, which this
> > API bypasses. And considering how short the code path through the mempool
> > cache is, the tracing overhead is relatively much. I.e.: memcpy(NIC->NIC) vs.
> > trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> >
> > A key optimization point could be the number of mbufs being moved to/from
> > the mempool cache. If that number was fixed at compile time, a faster
> > memcpy() could be used. However, it seems that different PMDs use bursts of
> > either 4, 8, or in this case 32 mbufs. If only they could agree on such a simple
> > detail.
> This patch removes the stores and loads which saves on the backend cycles. I do not think, other optimizations can do the same.

My thought here was that we can try to introduce for mempool-cache ZC API,
similar to one we have for the ring.
Then on TX free path we wouldn't need to copy mbufs to be freed to temporary array on the stack.
Instead we can put them straight from TX SW ring to the mempool cache.
That should save extra store/load for mbuf and might help to achieve some performance gain
without by-passing mempool.    

> 
> >
> > Overall, I strongly agree that it is preferable to optimize the core libraries,
> > rather than bypass them. Bypassing will eventually lead to "spaghetti code".
> IMO, this is not "spaghetti code". There is no design rule in DPDK that says the RX side must allocate buffers from a mempool or TX side
> must free buffers to a mempool. This patch does not break any modular boundaries. For ex: access internal details of another library.

I also have few concerns about that approach:
- proposed implementation breaks boundary logical boundary between RX/TX code.
  Right now they co-exist independently, and design of TX path doesn't directly affect RX path
  and visa-versa. With proposed approach RX path need to be aware about TX queue details and
  mbuf freeing strategy. So if we'll decide to change TX code, we probably would be able to do that   
  without affecting RX path.
  That probably can be fixed by formalizing things a bit more by introducing new dev-ops API:
  eth_dev_tx_queue_free_mbufs(port id, queue id, mbufs_to_free[], ...)
  But that would probably eat-up significant portion of the gain you are seeing right now.    

- very limited usage scenario - it will have a positive effect only when we have a fixed forwarding mapping:
  all (or nearly all) packets from the RX queue are forwarded into the same TX queue. 
  Even for l3fwd it doesn’t look like a generic scenario.

- we effectively link RX and TX queues - when this feature is enabled, user can't stop TX queue,
  without stopping RX queue first. 
  
  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-27  4:06   ` Honnappa Nagarahalli
@ 2022-01-27 17:13     ` Morten Brørup
  2022-01-28 11:29     ` Morten Brørup
  1 sibling, 0 replies; 67+ messages in thread
From: Morten Brørup @ 2022-01-27 17:13 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ananyev, Konstantin
  Cc: dev, nd, thomas, Feifei Wang, Yigit, Ferruh, Andrew Rybchenko,
	Zhang, Qi Z, Xing,  Beilei

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, 27 January 2022 05.07
> 
> Thanks Morten, appreciate your comments. Few responses inline.
> 
> > -----Original Message-----
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Sunday, December 26, 2021 4:25 AM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Friday, 24 December 2021 17.46
> > >
> <snip>
> 
> > >
> > > However, this solution poses several constraint:
> > >
> > > 1)The receive queue needs to know which transmit queue it should
> take
> > > the buffers from. The application logic decides which transmit port
> to
> > > use to send out the packets. In many use cases the NIC might have a
> > > single port ([1], [2], [3]), in which case a given transmit queue
> is
> > > always mapped to a single receive queue (1:1 Rx queue: Tx queue).
> This
> > > is easy to configure.
> > >
> > > If the NIC has 2 ports (there are several references), then we will
> > > have
> > > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > > However, if this is generalized to 'N' ports, the configuration can
> be
> > > long. More over the PMD would have to scan a list of transmit
> queues
> > > to pull the buffers from.
> >
> > I disagree with the description of this constraint.
> >
> > As I understand it, it doesn't matter now many ports or queues are in
> a NIC or
> > system.
> >
> > The constraint is more narrow:
> >
> > This patch requires that all packets ingressing on some port/queue
> must
> > egress on the specific port/queue that it has been configured to ream
> its
> > buffers from. I.e. an application cannot route packets between
> multiple ports
> > with this patch.
> Agree, this patch as is has this constraint. It is not a constraint
> that would apply for NICs with single port. The above text is
> describing some of the issues associated with generalizing the solution
> for N number of ports. If N is small, the configuration is small and
> scanning should not be bad.
> 

Perhaps we can live with the 1:1 limitation, if that is the primary use case.

Alternatively, the feature could fall back to using the mempool if unable to get/put buffers directly from/to a participating NIC. In this case, I envision a library serving as a shim layer between the NICs and the mempool. In other words: Take a step back from the implementation, and discuss the high level requirements and architecture of the proposed feature.

> >
> > >
> 
> <snip>
> 
> > >
> >
> > You are missing the fourth constraint:
> >
> > 4) The application must transmit all received packets immediately,
> i.e. QoS
> > queueing and similar is prohibited.
> I do not understand this, can you please elaborate?. Even if there is
> QoS queuing, there would be steady stream of packets being transmitted.
> These transmitted packets will fill the buffers on the RX side.

E.g. an appliance may receive packets on a 10 Gbps backbone port, and queue some of the packets up for a customer with a 20 Mbit/s subscription. When there is a large burst of packets towards that subscriber, they will queue up in the QoS queue dedicated to that subscriber. During that traffic burst, there is much more RX than TX. And after the traffic burst, there will be more TX than RX.

> 
> >
> <snip>
> 
> > >
> >
> > The patch provides a significant performance improvement, but I am
> > wondering if any real world applications exist that would use this.
> Only a
> > "router on a stick" (i.e. a single-port router) comes to my mind, and
> that is
> > probably sufficient to call it useful in the real world. Do you have
> any other
> > examples to support the usefulness of this patch?
> SmartNIC is a clear and dominant use case, typically they have a single
> port for data plane traffic (dual ports are mostly for redundancy)
> This patch avoids good amount of store operations. The smaller CPUs
> found in SmartNICs have smaller store buffers which can become
> bottlenecks. Avoiding the lcore cache saves valuable HW cache space.

OK. This is an important use case!

> 
> >
> > Anyway, the patch doesn't do any harm if unused, and the only
> performance
> > cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev
> driver. So I
> > don't oppose to it.
> >
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-27  4:06   ` Honnappa Nagarahalli
  2022-01-27 17:13     ` Morten Brørup
@ 2022-01-28 11:29     ` Morten Brørup
  1 sibling, 0 replies; 67+ messages in thread
From: Morten Brørup @ 2022-01-28 11:29 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ananyev, Konstantin
  Cc: dev, nd, thomas, Feifei Wang, Yigit, Ferruh, Andrew Rybchenko,
	Zhang, Qi Z, Xing,  Beilei

> From: Morten Brørup
> Sent: Thursday, 27 January 2022 18.14
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, 27 January 2022 05.07
> >
> > Thanks Morten, appreciate your comments. Few responses inline.
> >
> > > -----Original Message-----
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Sunday, December 26, 2021 4:25 AM
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Friday, 24 December 2021 17.46
> > > >
> > <snip>
> >
> > > >
> > > > However, this solution poses several constraint:
> > > >
> > > > 1)The receive queue needs to know which transmit queue it should
> > take
> > > > the buffers from. The application logic decides which transmit
> port
> > to
> > > > use to send out the packets. In many use cases the NIC might have
> a
> > > > single port ([1], [2], [3]), in which case a given transmit queue
> > is
> > > > always mapped to a single receive queue (1:1 Rx queue: Tx queue).
> > This
> > > > is easy to configure.
> > > >
> > > > If the NIC has 2 ports (there are several references), then we
> will
> > > > have
> > > > 1:2 (RX queue: TX queue) mapping which is still easy to
> configure.
> > > > However, if this is generalized to 'N' ports, the configuration
> can
> > be
> > > > long. More over the PMD would have to scan a list of transmit
> > queues
> > > > to pull the buffers from.
> > >
> > > I disagree with the description of this constraint.
> > >
> > > As I understand it, it doesn't matter now many ports or queues are
> in
> > a NIC or
> > > system.
> > >
> > > The constraint is more narrow:
> > >
> > > This patch requires that all packets ingressing on some port/queue
> > must
> > > egress on the specific port/queue that it has been configured to
> ream
> > its
> > > buffers from. I.e. an application cannot route packets between
> > multiple ports
> > > with this patch.
> > Agree, this patch as is has this constraint. It is not a constraint
> > that would apply for NICs with single port. The above text is
> > describing some of the issues associated with generalizing the
> solution
> > for N number of ports. If N is small, the configuration is small and
> > scanning should not be bad.

But I think N is the number of queues, not the number of ports.

> >
> 
> Perhaps we can live with the 1:1 limitation, if that is the primary use
> case.

Or some similar limitation for NICs with dual ports for redundancy.

> 
> Alternatively, the feature could fall back to using the mempool if
> unable to get/put buffers directly from/to a participating NIC. In this
> case, I envision a library serving as a shim layer between the NICs and
> the mempool. In other words: Take a step back from the implementation,
> and discuss the high level requirements and architecture of the
> proposed feature.

Please ignore my comment above. I had missed the fact that the direct re-arm feature only works inside a single NIC, and not across multiple NICs. And it is not going to work across multiple NICs, unless they are exactly the same type, because their internal descriptor structures may differ. Also, taking a deeper look at the i40e part of the patch, I notice that it already falls back to using the mempool.

> 
> > >
> > > >
> >
> > <snip>
> >
> > > >
> > >
> > > You are missing the fourth constraint:
> > >
> > > 4) The application must transmit all received packets immediately,
> > i.e. QoS
> > > queueing and similar is prohibited.
> > I do not understand this, can you please elaborate?. Even if there is
> > QoS queuing, there would be steady stream of packets being
> transmitted.
> > These transmitted packets will fill the buffers on the RX side.
> 
> E.g. an appliance may receive packets on a 10 Gbps backbone port, and
> queue some of the packets up for a customer with a 20 Mbit/s
> subscription. When there is a large burst of packets towards that
> subscriber, they will queue up in the QoS queue dedicated to that
> subscriber. During that traffic burst, there is much more RX than TX.
> And after the traffic burst, there will be more TX than RX.
> 
> >
> > >
> > <snip>
> >
> > > >
> > >
> > > The patch provides a significant performance improvement, but I am
> > > wondering if any real world applications exist that would use this.
> > Only a
> > > "router on a stick" (i.e. a single-port router) comes to my mind,
> and
> > that is
> > > probably sufficient to call it useful in the real world. Do you
> have
> > any other
> > > examples to support the usefulness of this patch?
> > SmartNIC is a clear and dominant use case, typically they have a
> single
> > port for data plane traffic (dual ports are mostly for redundancy)
> > This patch avoids good amount of store operations. The smaller CPUs
> > found in SmartNICs have smaller store buffers which can become
> > bottlenecks. Avoiding the lcore cache saves valuable HW cache space.
> 
> OK. This is an important use case!

Some NICs have many queues, so the number of RX/TX queue mappings is big. Aren't SmartNICs going to use many RX/TX queues?

> 
> >
> > >
> > > Anyway, the patch doesn't do any harm if unused, and the only
> > performance
> > > cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev
> > driver. So I
> > > don't oppose to it.

If a PMD maintainer agrees to maintaining such a feature, I don't oppose either.

The PMDs are full of cruft already, so why bother complaining about more, if the performance impact is negligible. :-)


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-27 16:45             ` Ananyev, Konstantin
@ 2022-02-02 19:46               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 67+ messages in thread
From: Honnappa Nagarahalli @ 2022-02-02 19:46 UTC (permalink / raw)
  To: Ananyev, Konstantin, Morten Brørup, thomas, Feifei Wang,
	Yigit, Ferruh
  Cc: dev, nd, Andrew Rybchenko, Zhang, Qi Z, Xing,  Beilei,
	Honnappa Nagarahalli, nd

<snip>

> 
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Tuesday, 18 January 2022 17.54
> > > >
> > > > [quick summary: ethdev API to bypass mempool]
> > > >
> > > > 18/01/2022 16:51, Ferruh Yigit:
> > > > > On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > > > > > Morten Brørup <mb@smartsharesystems.com>:
> > > > > >> The patch provides a significant performance improvement, but
> > > > > >> I am wondering if any real world applications exist that
> > > > > >> would use
> > > > this. Only a
> > > > > >> "router on a stick" (i.e. a single-port router) comes to my
> > > > > >> mind,
> > > > and that is
> > > > > >> probably sufficient to call it useful in the real world. Do
> > > > > >> you
> > > > have any other
> > > > > >> examples to support the usefulness of this patch?
> > > > > >>
> > > > > > One case I have is about network security. For network
> > > > > > firewall,
> > > > all packets need
> > > > > > to ingress on the specified port and egress on the specified
> > > > > > port
> > > > to do packet filtering.
> > > > > > In this case, we can know flow direction in advance.
> > > > >
> > > > > I also have some concerns on how useful this API will be in real
> > > > life,
> > > > > and does the use case worth the complexity it brings.
> > > > > And it looks too much low level detail for the application.
> > > >
> > > > That's difficult to judge.
> > > > The use case is limited and the API has some severe limitations.
> > > > The benefit is measured with l3fwd, which is not exactly a real app.
> > > > Do we want an API which improves performance in limited scenarios
> > > > at the cost of breaking some general design assumptions?
> > > >
> > > > Can we achieve the same level of performance with a mempool trick?
> > >
> > > Perhaps the mbuf library could offer bulk functions for alloc/free
> > > of raw mbufs - essentially a shortcut directly to the mempool library.
> > >
> > > There might be a few more details to micro-optimize in the mempool
> > > library, if approached with this use case in mind. E.g. the
> > > rte_mempool_default_cache() could do with a few unlikely() in its
> > > comparisons.
> > >
> > > Also, for this use case, the mempool library adds tracing overhead,
> > > which this API bypasses. And considering how short the code path
> > > through the mempool cache is, the tracing overhead is relatively much.
> I.e.: memcpy(NIC->NIC) vs.
> > > trace() memcpy(NIC->cache) trace() memcpy(cache->NIC).
> > >
> > > A key optimization point could be the number of mbufs being moved
> > > to/from the mempool cache. If that number was fixed at compile time,
> > > a faster
> > > memcpy() could be used. However, it seems that different PMDs use
> > > bursts of either 4, 8, or in this case 32 mbufs. If only they could
> > > agree on such a simple detail.
> > This patch removes the stores and loads which saves on the backend cycles.
> I do not think, other optimizations can do the same.
> 
> My thought here was that we can try to introduce for mempool-cache ZC API,
> similar to one we have for the ring.
> Then on TX free path we wouldn't need to copy mbufs to be freed to
> temporary array on the stack.
> Instead we can put them straight from TX SW ring to the mempool cache.
> That should save extra store/load for mbuf and might help to achieve some
> performance gain
> without by-passing mempool.
Agree, it will remove one set of loads and stores, but not all of them. I am not sure if it can solve the performance problems. We will give it a try.

> 
> >
> > >
> > > Overall, I strongly agree that it is preferable to optimize the core
> > > libraries, rather than bypass them. Bypassing will eventually lead to
> "spaghetti code".
> > IMO, this is not "spaghetti code". There is no design rule in DPDK
> > that says the RX side must allocate buffers from a mempool or TX side must
> free buffers to a mempool. This patch does not break any modular
> boundaries. For ex: access internal details of another library.
> 
> I also have few concerns about that approach:
> - proposed implementation breaks boundary logical boundary between RX/TX
> code.
>   Right now they co-exist independently, and design of TX path doesn't directly
> affect RX path
>   and visa-versa. With proposed approach RX path need to be aware about TX
> queue details and
>   mbuf freeing strategy. So if we'll decide to change TX code, we probably
> would be able to do that
>   without affecting RX path.
Agree that now both paths will be coupled on the areas you have mentioned. This is happening within the driver code. From the application perspective, they still remain separated. I also do not see that the TX free strategy has not changed much.

>   That probably can be fixed by formalizing things a bit more by introducing
> new dev-ops API:
>   eth_dev_tx_queue_free_mbufs(port id, queue id, mbufs_to_free[], ...)
>   But that would probably eat-up significant portion of the gain you are seeing
> right now.
> 
> - very limited usage scenario - it will have a positive effect only when we have
Agree, it is limited to few scenarios. But, the scenario itself is a major scenario.

> a fixed forwarding mapping:
>   all (or nearly all) packets from the RX queue are forwarded into the same TX
> queue.
>   Even for l3fwd it doesn’t look like a generic scenario.
I think it is possible to have some logic (based on the port mask and the routes involved) to enable this feature. We will try to add that in the next version.

> 
> - we effectively link RX and TX queues - when this feature is enabled, user
> can't stop TX queue,
>   without stopping RX queue first.
Agree. How much of an issue is this? I would think when the application is shutting down, one would stop the RX side first. Are there any other scenarios we need to be aware of?

> 
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* 回复: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2022-01-18 15:51     ` Ferruh Yigit
  2022-01-18 16:53       ` Thomas Monjalon
@ 2023-02-28  6:43       ` Feifei Wang
  2023-02-28  6:52         ` Feifei Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-02-28  6:43 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: dev, nd, thomas, Andrew Rybchenko, Qi Zhang, Beilei Xing,
	Konstantin Ananyev, konstantin.ananyev, Ruifeng Wang,
	Honnappa Nagarahalli, Morten Brørup, nd

Hi, Ferruh

This email summarizes our latest improvement work for direct-rearm and
hope it can fix some concerns about direct-rearm.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Ferruh Yigit <ferruh.yigit@intel.com>
> 发送时间: Tuesday, January 18, 2022 11:52 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Morten Brørup
> <mb@smartsharesystems.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; thomas@monjalon.net; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Qi Zhang
> <qi.z.zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>
> 主题: Re: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive
> side
> 
> On 12/28/2021 6:55 AM, Feifei Wang wrote:
> > Thanks for your comments.
> >
> >> -----邮件原件-----
> >> 发件人: Morten Brørup <mb@smartsharesystems.com>
> >> 发送时间: Sunday, December 26, 2021 6:25 PM
> >> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> >> 抄送: dev@dpdk.org; nd <nd@arm.com>
> >> 主题: RE: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive
> >> side
> >>
> >>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>> Sent: Friday, 24 December 2021 17.46
> >>>
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraint:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >>
> >> I disagree with the description of this constraint.
> >>
> >> As I understand it, it doesn't matter now many ports or queues are in
> >> a NIC or system.
> >>
> >> The constraint is more narrow:
> >>
> >> This patch requires that all packets ingressing on some port/queue
> >> must egress on the specific port/queue that it has been configured to
> >> ream its buffers from. I.e. an application cannot route packets
> >> between multiple ports with this patch.
> >
> > First, I agree with that direct-rearm mode is suitable for the case
> > that user should know the direction of the flow in advance and map
> > rx/tx with each other. It is not suitable for the normal packet random route
> case.
> >
> > Second, our proposed two cases: one port NIC and two port NIC means
> > the direction of flow is determined. Furthermore, for two port NIC,
> > there maybe two flow directions: from port 0 to port 1, or from port 0
> > to port 0. Thus we need to have
> > 1:2 (Rx queue :  Tx queue) mapping.
> >
> > At last, maybe we can change our description as follows:
> > "The first constraint is that user should know the direction of the
> > flow in advance, and based on this, user needs to map the Rx and Tx
> queues according to the flow direction:
> > For example, if the NIC just has one port
> >   ......
> > Or if the NIC have two ports
> > ......."
> >
> >>
> >>>
> >>> 2)The other factor that needs to be considered is 'run-to-completion'
> >>> vs
> >>> 'pipeline' models. In the run-to-completion model, the receive side
> >>> and the transmit side are running on the same lcore serially. In the
> >>> pipeline model. The receive side and transmit side might be running
> >>> on different lcores in parallel. This requires locking. This is not
> >>> supported at this point.
> >>>
> >>> 3)Tx and Rx buffers must be from the same mempool. And we also must
> >>> ensure Tx buffer free number is equal to Rx buffer free number:
> >>> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH) Thus,
> 'tx_next_dd'
> >>> can be updated correctly in direct-rearm mode. This is due to
> >>> tx_next_dd is a variable to compute tx sw-ring free location.
> >>> Its value will be one more round than the position where next time
> >>> free starts.
> >>>
> >>
> >> You are missing the fourth constraint:
> >>
> >> 4) The application must transmit all received packets immediately,
> >> i.e. QoS queueing and similar is prohibited.
> >>
> >
> > You are right and this is indeed one of the limitations.
> >
> >>> Current status in this RFC:
> >>> 1)An API is added to allow for mapping a TX queue to a RX queue.
> >>>    Currently it supports 1:1 mapping.
> >>> 2)The i40e driver is changed to do the direct re-arm of the receive
> >>>    side.
> >>> 3)L3fwd application is hacked to do the mapping for the following
> >>> command:
> >>>    one core two flows case:
> >>>    $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
> >>>    -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
> >>>    where:
> >>>    Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
> >>>    Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> >>>
> >>> Testing status:
> >>> 1)Tested L3fwd with the above command:
> >>> The testing results for L3fwd are as follows:
> >>> -------------------------------------------------------------------
> >>> N1SDP:
> >>> Base performance(with this patch)   with direct re-arm mode enabled
> >>>        0%                                  +14.1%
> >>>
> >>> Ampere Altra:
> >>> Base performance(with this patch)   with direct re-arm mode enabled
> >>>        0%                                  +17.1%
> >>> -------------------------------------------------------------------
> >>> This patch can not affect performance of normal mode, and if enable
> >>> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> >>> and ampera-altra.
> >>>
> >>> Feedback requested:
> >>> 1) Has anyone done any similar experiments, any lessons learnt?
> >>> 2) Feedback on API
> >>>
> >>> Next steps:
> >>> 1) Update the code for supporting 1:N(Rx : TX) mapping
> >>> 2) Automate the configuration in L3fwd sample application
> >>>
> >>> Reference:
> >>> [1] https://store.nvidia.com/en-
> >>> us/networking/store/product/MCX623105AN-
> >>>
> >>
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypt
> >> oDisabled
> >>> / [2]
> >>>
> https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> >>> ethernet-network-adapter-e810cqda1/specifications.html
> >>> [3] https://www.broadcom.com/products/ethernet-
> >> connectivity/network-
> >>> adapters/100gb-nic-ocp/n1100g
> >>>
> >>> Feifei Wang (4):
> >>>    net/i40e: enable direct re-arm mode
> >>>    ethdev: add API for direct re-arm mode
> >>>    net/i40e: add direct re-arm mode internal API
> >>>    examples/l3fwd: give an example for direct rearm mode
> >>>
> >>>   drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
> >>>   drivers/net/i40e/i40e_rxtx.h          |   4 +
> >>>   drivers/net/i40e/i40e_rxtx_vec_neon.c | 149
> >> +++++++++++++++++++++++++-
> >>>   examples/l3fwd/main.c                 |   3 +
> >>>   lib/ethdev/ethdev_driver.h            |  15 +++
> >>>   lib/ethdev/rte_ethdev.c               |  14 +++
> >>>   lib/ethdev/rte_ethdev.h               |  31 ++++++
> >>>   lib/ethdev/version.map                |   3 +
> >>>   8 files changed, 251 insertions(+), 2 deletions(-)
> >>>
> >>> --
> >>> 2.25.1
> >>>
> >>
> >> The patch provides a significant performance improvement, but I am
> >> wondering if any real world applications exist that would use this.
> >> Only a "router on a stick" (i.e. a single-port router) comes to my
> >> mind, and that is probably sufficient to call it useful in the real
> >> world. Do you have any other examples to support the usefulness of this
> patch?
> >>
> > One case I have is about network security. For network firewall, all
> > packets need to ingress on the specified port and egress on the specified
> port to do packet filtering.
> > In this case, we can know flow direction in advance.
> >
> 
> I also have some concerns on how useful this API will be in real life, and does
> the use case worth the complexity it brings.
> And it looks too much low level detail for the application.

Concerns of direct rearm:
1. Earlier version of the design required the rxq/txq pairing to be done before
starting the data plane threads. This required the user to know the direction
of the packet flow in advance. This limited the use cases.

In the latest version, direct-rearm mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
------------------------------------------------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid, rx_queueid)
Step 3: rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_rearm_data);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
------------------------------------------------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at runtime and user does not need to
know the direction of flow in advance. This can effectively expand direct-rearm
use scenarios.

2. Earlier version of direct rearm was breaking the independence between the RX and TX path.
In the latest version, we use a structure to let Rx and Tx interact, for example:
-----------------------------------------------------------------------------------------------------------------------------------
struct rte_eth_rxq_rearm_data {
       struct rte_mbuf **buf_ring; /**< Buffer ring of Rx queue. */
       uint16_t *refill_head;            /**< Head of buffer ring refilling descriptors. */
       uint16_t *receive_tail;          /**< Tail of buffer ring receiving pkts. */
       uint16_t nb_buf;                    /**< configured number of buffer ring. */
}  rxq_rearm_data;

data path:
	/* Get direct-rearm info for a receive queue of an Ethernet device. */
	rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid, rx_queueid);
	rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_rearm_data) {

		/*  Using Tx used buffer to refill Rx buffer ring in direct rearm mode */
		nb_rearm = rte_eth_tx_fill_sw_ring(tx_portid, tx_queueid, rxq_rearm_data );

		/* Flush Rx descriptor in direct rearm mode */
		rte_eth_rx_flush_descs(rx_portid, rx_queuid, nb_rearm) ;
	}
	rte_eth_rx_burst(rx_portid,rx_queueid);
	rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------------------------------------------------------------------
Furthermore, this let direct-rearm usage no longer limited to the same pmd,
it can support moving buffers between different vendor pmds, even can put the buffer
anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by direct-rearm.

3. Difference between direct rearm, ZC API used in mempool  and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For direct_rearm:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path direct rearm reduce 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see direct rearm reduce 32 pkts memcpy in each loop.
So, direct_rearm has its own benefits.

4. Performance test and real cases
For performance test, in l3fwd, we achieve the performance improvement of up to 15% in Arm server.
For real cases, we have enabled direct-rearm in vpp and achieved performance improvement.

> 
> cc'ed a few more folks for comment.
> 
> >> Anyway, the patch doesn't do any harm if unused, and the only
> >> performance cost is the "if (rxq->direct_rxrearm_enable)" branch in
> >> the Ethdev driver. So I don't oppose to it.
> >>
> >


^ permalink raw reply	[flat|nested] 67+ messages in thread

* 回复: 回复: [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side
  2023-02-28  6:43       ` 回复: " Feifei Wang
@ 2023-02-28  6:52         ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-02-28  6:52 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev, nd, nd

CC to the right e-mail address.

> > I also have some concerns on how useful this API will be in real life,
> > and does the use case worth the complexity it brings.
> > And it looks too much low level detail for the application.
> 
> Concerns of direct rearm:
> 1. Earlier version of the design required the rxq/txq pairing to be done
> before starting the data plane threads. This required the user to know the
> direction of the packet flow in advance. This limited the use cases.
> 
> In the latest version, direct-rearm mode is packaged as a separate API.
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> ----------------------------------------------------------------------------------------------
> --------------
> Step 1: upper application analyse the flow direction Step 2: rxq_rearm_data =
> rte_eth_rx_get_rearm_data(rx_portid, rx_queueid) Step 3:
> rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid, tx_queueid,
> rxq_rearm_data); Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> ----------------------------------------------------------------------------------------------
> --------------
> Above can support user to change rxq/txq pairing  at runtime and user does
> not need to know the direction of flow in advance. This can effectively
> expand direct-rearm use scenarios.
> 
> 2. Earlier version of direct rearm was breaking the independence between
> the RX and TX path.
> In the latest version, we use a structure to let Rx and Tx interact, for example:
> ----------------------------------------------------------------------------------------------
> -------------------------------------
> struct rte_eth_rxq_rearm_data {
>        struct rte_mbuf **buf_ring; /**< Buffer ring of Rx queue. */
>        uint16_t *refill_head;            /**< Head of buffer ring refilling descriptors.
> */
>        uint16_t *receive_tail;          /**< Tail of buffer ring receiving pkts. */
>        uint16_t nb_buf;                    /**< configured number of buffer ring. */
> }  rxq_rearm_data;
> 
> data path:
> 	/* Get direct-rearm info for a receive queue of an Ethernet device.
> */
> 	rxq_rearm_data = rte_eth_rx_get_rearm_data(rx_portid,
> rx_queueid);
> 	rte_eth_dev_direct_rearm(rx_portid, rx_queueid, tx_portid,
> tx_queueid, rxq_rearm_data) {
> 
> 		/*  Using Tx used buffer to refill Rx buffer ring in direct rearm
> mode */
> 		nb_rearm = rte_eth_tx_fill_sw_ring(tx_portid, tx_queueid,
> rxq_rearm_data );
> 
> 		/* Flush Rx descriptor in direct rearm mode */
> 		rte_eth_rx_flush_descs(rx_portid, rx_queuid, nb_rearm) ;
> 	}
> 	rte_eth_rx_burst(rx_portid,rx_queueid);
> 	rte_eth_tx_burst(tx_portid,tx_queueid);
> ----------------------------------------------------------------------------------------------
> -------------------------------------
> Furthermore, this let direct-rearm usage no longer limited to the same pmd,
> it can support moving buffers between different vendor pmds, even can put
> the buffer anywhere into your Rx buffer ring as long as the address of the
> buffer ring can be provided.
> In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and
> also try to use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9%
> performance improvement by direct-rearm.
> 
> 3. Difference between direct rearm, ZC API used in mempool  and general
> path For general path:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts
> memcpy from temporary variable to mempool cache For ZC API used in
> mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link:
> http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-
> kamalakshitha.aligeri@arm.com/
> For direct_rearm:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring Thus we can
> see in the one loop, compared to general path direct rearm reduce 32+32=64
> pkts memcpy; Compared to ZC API used in mempool, we can see direct
> rearm reduce 32 pkts memcpy in each loop.
> So, direct_rearm has its own benefits.
> 
> 4. Performance test and real cases
> For performance test, in l3fwd, we achieve the performance improvement
> of up to 15% in Arm server.
> For real cases, we have enabled direct-rearm in vpp and achieved
> performance improvement.
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v4 0/3] Recycle buffers from Tx to Rx
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
                   ` (4 preceding siblings ...)
  2021-12-26 10:25 ` [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Morten Brørup
@ 2023-03-23 10:43 ` Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 1/3] ethdev: add API for buffer recycle mode Feifei Wang
                     ` (2 more replies)
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
  7 siblings, 3 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-23 10:43 UTC (permalink / raw)
  Cc: dev, mb, konstantin.v.ananyev, nd, Feifei Wang

Note: this version is just for community review, not the final version.

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as buffer
recycle mode.

In the latest version, buffer recycle mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_buf_recycle_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at runtime and user does not need to
know the direction of flow in advance. This can effectively expand buffer recycle mode's
use scenarios.

Furthermore, buffer recycle mode is no longer limited to the same pmd,
it can support moving buffers between different vendor pmds, even can put the buffer
anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by buffer recycle mode.

Difference between buffer recycle, ZC API used in mempool and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For buffer recycle:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, buffer recycle reduce 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
So, buffer recycle has its own benefits.

Testing status:
(1) dpdk l3fwd test with multiple drivers:
    port 0: 82599 NIC   port 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Thunderx2:      +7.53%	                +13.54%
-------------------------------------------------------------

(2) dpdk l3fwd test with same driver:
    port 0 && 1: XL710 NIC
-------------------------------------------------------------
*Direct rearm with exposing rx_sw_ring:
		Without fast free	With fast free
Ampere altra:   +12.61%		        +11.42%
n1sdp:		+8.30%			+3.85%
-------------------------------------------------------------

V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)

V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm

v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)

2. Add ring wrapping check (Konstantin)


Feifei Wang (3):
  ethdev: add API for buffer recycle mode
  net/i40e: implement recycle buffer mode
  net/ixgbe: implement recycle buffer mode

 drivers/net/i40e/i40e_ethdev.c            |   1 +
 drivers/net/i40e/i40e_ethdev.h            |   2 +
 drivers/net/i40e/i40e_rxtx.c              |  24 +++
 drivers/net/i40e/i40e_rxtx.h              |   4 +
 drivers/net/i40e/i40e_rxtx_vec_common.h   | 128 ++++++++++++
 drivers/net/ixgbe/ixgbe_ethdev.c          |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h          |   3 +
 drivers/net/ixgbe/ixgbe_rxtx.c            |  25 +++
 drivers/net/ixgbe/ixgbe_rxtx.h            |   4 +
 drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 127 ++++++++++++
 lib/ethdev/ethdev_driver.h                |  10 +
 lib/ethdev/ethdev_private.c               |   2 +
 lib/ethdev/rte_ethdev.c                   |  37 ++++
 lib/ethdev/rte_ethdev.h                   | 230 ++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h              |  11 ++
 lib/ethdev/version.map                    |   6 +
 16 files changed, 615 insertions(+)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v4 1/3] ethdev: add API for buffer recycle mode
  2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
@ 2023-03-23 10:43   ` Feifei Wang
  2023-03-23 11:41     ` Morten Brørup
  2023-03-23 10:43   ` [PATCH v4 2/3] net/i40e: implement recycle buffer mode Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 3/3] net/ixgbe: " Feifei Wang
  2 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-03-23 10:43 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, mb, konstantin.v.ananyev, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

There are 4 upper APIs for buffer recycle mode:
1. 'rte_eth_rx_queue_buf_recycle_info_get'
This is to retrieve buffer ring information about given ports's Rx
queue in buffer recycle mode. And due to this, buffer recycle can
be no longer limited to the same driver in Rx and Tx.

2. 'rte_eth_dev_buf_recycle'
Users can call this API to enable buffer recycle mode in data path.
There are 2 internal APIs in it, which is separately for Rx and TX.

3. 'rte_eth_tx_buf_stash'
Internal API for buffer recycle mode. This is to stash Tx used
buffers into Rx buffer ring.

4. 'rte_eth_rx_descriptors_refill'
Internal API for buffer recycle mode. This is to refill Rx
descriptors.

Above all APIs are just implemented at the upper level.
For different APIs, we need to define specific functions separately.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/ethdev/ethdev_driver.h   |  10 ++
 lib/ethdev/ethdev_private.c  |   2 +
 lib/ethdev/rte_ethdev.c      |  37 ++++++
 lib/ethdev/rte_ethdev.h      | 230 +++++++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h |  11 ++
 lib/ethdev/version.map       |   6 +
 6 files changed, 296 insertions(+)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 2c9d615fb5..412f064975 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@ struct rte_eth_dev {
 	eth_rx_descriptor_status_t rx_descriptor_status;
 	/** Check the status of a Tx descriptor */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Stash Tx used buffers into RX ring in buffer recycle mode */
+	eth_tx_buf_stash_t tx_buf_stash;
+	/** Refill Rx descriptors in buffer recycle mode */
+	eth_rx_descriptors_refill_t rx_descriptors_refill;
 
 	/**
 	 * Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
 	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
 
+typedef void (*eth_rxq_buf_recycle_info_get_t)(struct rte_eth_dev *dev,
+	uint16_t rx_queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
 	uint16_t queue_id, struct rte_eth_burst_mode *mode);
 
@@ -1247,6 +1255,8 @@ struct eth_dev_ops {
 	eth_rxq_info_get_t         rxq_info_get;
 	/** Retrieve Tx queue information */
 	eth_txq_info_get_t         txq_info_get;
+	/** Get Rx queue buffer recycle information */
+	eth_rxq_buf_recycle_info_get_t rxq_buf_recycle_info_get;
 	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst mode */
 	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst mode */
 	eth_fw_version_get_t       fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8d0ae9226 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
 	fpo->rx_queue_count = dev->rx_queue_count;
 	fpo->rx_descriptor_status = dev->rx_descriptor_status;
 	fpo->tx_descriptor_status = dev->tx_descriptor_status;
+	fpo->tx_buf_stash = dev->tx_buf_stash;
+	fpo->rx_descriptors_refill = dev->rx_descriptors_refill;
 
 	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 4d03255683..2e049f2176 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5784,6 +5784,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	return 0;
 }
 
+int
+rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (rxq_buf_recycle_info == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx queue %u buf_recycle_info to NULL\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (dev->data->rx_queues == NULL ||
+			dev->data->rx_queues[queue_id] == NULL) {
+		RTE_ETHDEV_LOG(ERR,
+			   "Rx queue %"PRIu16" of device with port_id=%"
+			   PRIu16" has not been setup\n",
+			   queue_id, port_id);
+		return -EINVAL;
+	}
+
+	if (*dev->dev_ops->rxq_buf_recycle_info_get == NULL)
+		return -ENOTSUP;
+
+	dev->dev_ops->rxq_buf_recycle_info_get(dev, queue_id, rxq_buf_recycle_info);
+
+	return 0;
+}
+
 int
 rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 			  struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 99fe9e238b..977075782e 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,29 @@ struct rte_eth_txq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 } __rte_cache_min_aligned;
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue buffer ring information structure in buffer recycle mode.
+ * Used to retrieve Rx queue buffer ring information when Tx queue stashing used buffers
+ * in Rx buffer ring.
+ */
+struct rte_eth_rxq_buf_recycle_info {
+	struct rte_mbuf **buf_ring; /**< buffer ring of Rx queue. */
+	struct rte_mempool *mp;     /**< mempool of Rx queue. */
+	uint16_t *refill_head;      /**< head of buffer ring refilling descriptors. */
+	uint16_t *receive_tail;     /**< tail of buffer ring receiving pkts. */
+	uint16_t buf_ring_size;     /**< configured number of buffer ring size. */
+	/**
+	 *  request for number of Rx refill buffers.
+	 *   For some PMD drivers, Rx refiil buffers number should be aligned with
+	 *   its buffer ring size. This is to simplify ring wraparound.
+	 *   Value 0 means that no request for this.
+	 */
+	uint16_t refill_request;
+} __rte_cache_min_aligned;
+
 /* Generic Burst mode flag definition, values can be ORed. */
 
 /**
@@ -4809,6 +4832,32 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve buffer ring information about given ports's Rx queue in buffer recycle
+ * mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which buffer ring information
+ *   will be retrieved.
+ * @param rxq_buf_recycle_info
+ *   A pointer to a structure of type *rte_eth_rxq_buf_recycle_info* to be filled.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENODEV:  If *port_id* is invalid.
+ *   - -ENOTSUP: routine is not supported by the device PMD.
+ *   - -EINVAL:  The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id,
+		uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 /**
  * Retrieve information about the Rx packet burst mode.
  *
@@ -5987,6 +6036,71 @@ rte_eth_rx_queue_count(uint16_t port_id, uint16_t queue_id)
 	return (int)(*p->rx_queue_count)(qd);
 }
 
+/**
+ * @internal
+ * Rx routine for rte_eth_dev_buf_recycle().
+ * Refill Rx descriptors in buffer recycle mode.
+ *
+ * @note
+ * This API can only be called by rte_eth_dev_buf_recycle().
+ * Before calling this API, rte_eth_tx_buf_stash() should be
+ * called to stash Tx used buffers into Rx buffer ring.
+ *
+ * When this functionality is not implemented in the driver, the return
+ * buffer number is 0.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the receive queue.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ *@param nb
+ *   The number of Rx descriptors to be refilled.
+ * @return
+ *   The number Rx descriptors correct to be refilled.
+ *   - ENODEV: bad port or queue (only if compiled with debug).
+ */
+static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
+		uint16_t queue_id, uint16_t nb)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->rxq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
+		rte_errno = ENODEV;
+		return 0;
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	if (!p->rx_descriptors_refill)
+		return 0;
+
+	return p->rx_descriptors_refill(qd, nb);
+}
+
 /**@{@name Rx hardware descriptor states
  * @see rte_eth_rx_descriptor_status
  */
@@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+/**
+ * @internal
+ * Tx routine for rte_eth_dev_buf_recycle().
+ * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
+ *
+ * @note
+ * This API can only be called by rte_eth_dev_buf_recycle().
+ * After calling this API, rte_eth_rx_descriptors_refill() should be
+ * called to refill Rx ring descriptors.
+ *
+ * When this functionality is not implemented in the driver, the return
+ * buffer number is 0.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the transmit queue.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_buf_recycle_info
+ *   A pointer to a structure of Rx queue buffer ring information in buffer
+ *   recycle mode.
+ *
+ * @return
+ *   The number buffers correct to be filled in the Rx buffer ring.
+ *   - ENODEV: bad port or queue (only if compiled with debug).
+ */
+static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->txq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
+		rte_errno = ENODEV;
+		return 0;
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		rte_erno = ENODEV;
+		return 0;
+	}
+#endif
+
+	if (p->tx_buf_stash == NULL)
+		return 0;
+
+	return p->tx_buf_stash(qd, rxq_buf_recycle_info);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Buffer recycle mode can let Tx queue directly put used buffers into Rx buffer
+ * ring. This avoids freeing buffers into mempool and allocating buffers from
+ * mempool.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_recycle_info
+ *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ * @return
+ *   - (0) on success or no recycling buffer.
+ *   - (-EINVAL) rxq_recycle_info is NULL.
+ */
+__rte_experimental
+static inline int
+rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	/* The number of recycling buffers. */
+	uint16_t nb_buf;
+
+	if (!rxq_buf_recycle_info)
+		return -EINVAL;
+
+	/* Stash Tx used buffers into Rx buffer ring */
+	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
+				rxq_buf_recycle_info);
+	/* If there are recycling buffers, refill Rx queue descriptors. */
+	if (nb_buf)
+		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
+					nb_buf);
+
+	return 0;
+}
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..10f9d5cbe7 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
 /** @internal Check the status of a Tx descriptor */
 typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
 
+/** @internal Stash Tx used buffers into RX ring in buffer recycle mode */
+typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
+/** @internal Refill Rx descriptors in buffer recycle mode */
+typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
 /**
  * @internal
  * Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
 	eth_rx_queue_count_t rx_queue_count;
 	/** Check the status of a Rx descriptor. */
 	eth_rx_descriptor_status_t rx_descriptor_status;
+	/** Refill Rx descriptors in buffer recycle mode */
+	eth_rx_descriptors_refill_t rx_descriptors_refill;
 	/** Rx queues data. */
 	struct rte_ethdev_qdata rxq;
 	uintptr_t reserved1[3];
@@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
 	eth_tx_prep_t tx_pkt_prepare;
 	/** Check the status of a Tx descriptor. */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Stash Tx used buffers into RX ring in buffer recycle mode */
+	eth_tx_buf_stash_t tx_buf_stash;
 	/** Tx queues data. */
 	struct rte_ethdev_qdata txq;
 	uintptr_t reserved2[3];
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 357d1a88c0..8a4b1dac80 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -299,6 +299,10 @@ EXPERIMENTAL {
 	rte_flow_action_handle_query_update;
 	rte_flow_async_action_handle_query_update;
 	rte_flow_async_create_by_index;
+
+	# added in 23.07
+	rte_eth_dev_buf_recycle;
+	rte_eth_rx_queue_buf_recycle_info_get;
 };
 
 INTERNAL {
@@ -328,4 +332,6 @@ INTERNAL {
 	rte_eth_representor_id_get;
 	rte_eth_switch_domain_alloc;
 	rte_eth_switch_domain_free;
+	rte_eth_tx_buf_stash;
+	rte_eth_rx_descriptors_refill;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v4 2/3] net/i40e: implement recycle buffer mode
  2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 1/3] ethdev: add API for buffer recycle mode Feifei Wang
@ 2023-03-23 10:43   ` Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 3/3] net/ixgbe: " Feifei Wang
  2 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-23 10:43 UTC (permalink / raw)
  To: Yuying Zhang, Beilei Xing
  Cc: dev, mb, konstantin.v.ananyev, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for i40e driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_ethdev.c          |   1 +
 drivers/net/i40e/i40e_ethdev.h          |   2 +
 drivers/net/i40e/i40e_rxtx.c            |  24 +++++
 drivers/net/i40e/i40e_rxtx.h            |   4 +
 drivers/net/i40e/i40e_rxtx_vec_common.h | 128 ++++++++++++++++++++++++
 5 files changed, 159 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index cb0070f94b..456fb256f5 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.flow_ops_get                 = i40e_dev_flow_ops_get,
 	.rxq_info_get                 = i40e_rxq_info_get,
 	.txq_info_get                 = i40e_txq_info_get,
+	.rxq_buf_recycle_info_get     = i40e_rxq_buf_recycle_info_get,
 	.rx_burst_mode_get            = i40e_rx_burst_mode_get,
 	.tx_burst_mode_get            = i40e_tx_burst_mode_get,
 	.timesync_enable              = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 9b806d130e..83c5ff5859 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
+void i40e_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
 int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
 			   struct rte_eth_burst_mode *mode);
 int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..479964c6c4 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3197,6 +3197,28 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.offloads = txq->offloads;
 }
 
+void
+i40e_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct i40e_rx_queue *rxq;
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	rxq_buf_recycle_info->buf_ring = (void *)rxq->sw_ring;
+	rxq_buf_recycle_info->mp = rxq->mp;
+	rxq_buf_recycle_info->buf_ring_size = rxq->nb_rx_desc;
+	rxq_buf_recycle_info->refill_request = RTE_I40E_RXQ_REARM_THRESH;
+
+#if RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+	rxq_buf_recycle_info->refill_head = &rxq->rxrearm_start + 0xF;
+	rxq_buf_recycle_info->receive_tail = &rxq->rx_tail + 0xF;
+#else
+	rxq_buf_recycle_info->refill_head = &rxq->rxrearm_start;
+	rxq_buf_recycle_info->receive_tail = &rxq->rx_tail;
+#endif
+}
+
 #ifdef RTE_ARCH_X86
 static inline bool
 get_avx_supported(bool request_avx512)
@@ -3273,6 +3295,7 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
 
 	if (ad->rx_vec_allowed  &&
 	    rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) {
+		dev->rx_descriptors_refill = i40e_rx_descriptors_refill_vec;
 #ifdef RTE_ARCH_X86
 		if (dev->data->scattered_rx) {
 			if (ad->rx_use_avx512) {
@@ -3465,6 +3488,7 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 	if (ad->tx_simple_allowed) {
 		if (ad->tx_vec_allowed &&
 		    rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128) {
+			dev->tx_buf_stash = i40e_tx_buf_stash_vec;
 #ifdef RTE_ARCH_X86
 			if (ad->tx_use_avx512) {
 #ifdef CC_AVX512_SUPPORT
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..0ad8f530f9 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
 int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
 int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
 
+uint16_t i40e_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+uint16_t i40e_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb);
+
 uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			    uint16_t nb_pkts);
 uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..068ce694f2 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -156,6 +156,134 @@ tx_backlog_entry(struct i40e_tx_entry *txep,
 		txep[i].mbuf = tx_pkts[i];
 }
 
+uint16_t
+i40e_tx_buf_stash_vec(void *tx_queue,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct i40e_tx_queue *txq = tx_queue;
+	struct i40e_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint16_t avail = 0;
+	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
+	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
+	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
+	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
+	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Buffer recycle can only support no ring buffer wraparound.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * buffer ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_request.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with buffer ring size. In this case, the update of refill head can not
+	 * exceed the Rx buffer ring size.
+	 */
+	if (refill_request != n ||
+		(!refill_request && (refill_head + n > buf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = rxq_buf_recycle_info->buf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempoo. */
+			if (unlikely(rxq_buf_recycle_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_request))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
+
+uint16_t
+i40e_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	struct i40e_rx_entry *rxep;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* flush desc with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= nb;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			(rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	rte_io_wmb();
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+
+	return nb;
+}
+
 static inline void
 _i40e_rx_queue_release_mbufs_vec(struct i40e_rx_queue *rxq)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v4 3/3] net/ixgbe: implement recycle buffer mode
  2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 1/3] ethdev: add API for buffer recycle mode Feifei Wang
  2023-03-23 10:43   ` [PATCH v4 2/3] net/i40e: implement recycle buffer mode Feifei Wang
@ 2023-03-23 10:43   ` Feifei Wang
  2 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-23 10:43 UTC (permalink / raw)
  To: Qiming Yang, Wenjun Wu
  Cc: dev, mb, konstantin.v.ananyev, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c          |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h          |   3 +
 drivers/net/ixgbe/ixgbe_rxtx.c            |  25 +++++
 drivers/net/ixgbe/ixgbe_rxtx.h            |   4 +
 drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 127 ++++++++++++++++++++++
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 88118bc305..3bada9abbd 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.set_mc_addr_list     = ixgbe_dev_set_mc_addr_list,
 	.rxq_info_get         = ixgbe_rxq_info_get,
 	.txq_info_get         = ixgbe_txq_info_get,
+	.rxq_buf_recycle_info_get = ixgbe_rxq_buf_recycle_info_get,
 	.timesync_enable      = ixgbe_timesync_enable,
 	.timesync_disable     = ixgbe_timesync_disable,
 	.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 48290af512..ca6aa0da64 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -625,6 +625,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+void ixgbe_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
 
 void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index c9d6ca9efe..ad276cbf33 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2558,6 +2558,7 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
 				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
 					ixgbe_txq_vec_setup(txq) == 0)) {
 			PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+			dev->tx_buf_stash = ixgbe_tx_buf_stash_vec;
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		} else
 		dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4852,6 +4853,7 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
 			     RTE_IXGBE_DESCS_PER_LOOP,
 			     dev->data->port_id);
 
+		dev->rx_descriptors_refill = ixgbe_rx_descriptors_refill_vec;
 		dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
 	} else if (adapter->rx_bulk_alloc_allowed) {
 		PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5623,6 +5625,29 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
 }
 
+void
+ixgbe_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct ixgbe_rx_queue *rxq;
+	struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	rxq_buf_recycle_info->buf_ring = (void *)rxq->sw_ring;
+	rxq_buf_recycle_info->mp = rxq->mb_pool;
+	rxq_buf_recycle_info->buf_ring_size = rxq->nb_rx_desc;
+	rxq_buf_recycle_info->receive_tail = &rxq->rx_tail;
+
+	if (adapter->rx_vec_allowed) {
+		rxq_buf_recycle_info->refill_request = RTE_IXGBE_RXQ_REARM_THRESH;
+		rxq_buf_recycle_info->refill_head = &rxq->rxrearm_start;
+	} else {
+		rxq_buf_recycle_info->refill_request = rxq->rx_free_thresh;
+		rxq_buf_recycle_info->refill_head = &rxq->rx_free_trigger;
+	}
+}
+
 /*
  * [VF] Initializes Receive Unit.
  */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..18f890f91a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
 extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
 extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
 
+uint16_t ixgbe_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+uint16_t ixgbe_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb);
+
 uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 				    uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h b/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
index a4d9ec9b08..e66a4a2d5b 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
@@ -139,6 +139,133 @@ tx_backlog_entry(struct ixgbe_tx_entry_v *txep,
 		txep[i].mbuf = tx_pkts[i];
 }
 
+uint16_t
+ixgbe_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct ixgbe_tx_queue *txq = tx_queue;
+	struct ixgbe_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint32_t status;
+	uint16_t avail = 0;
+	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
+	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
+	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
+	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
+	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	status = txq->tx_ring[txq->tx_next_dd].wb.status;
+	if (!(status & IXGBE_ADVTXD_STAT_DD))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Buffer recycle can only support no ring buffer wraparound.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * buffer ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_request.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with buffer ring size. In this case, the update of refill head can not
+	 * exceed the Rx buffer ring size.
+	 */
+	if (refill_request != n ||
+		(!refill_request && (refill_head + n > buf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = rxq_buf_recycle_info->buf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempoo. */
+			if (unlikely(rxq_buf_recycle_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_request))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
+
+uint16_t
+ixgbe_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb)
+{
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	struct ixgbe_rx_entry *rxep;
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* flush desc with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= nb;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			(rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+
+	return nb;
+}
+
 static inline void
 _ixgbe_tx_queue_release_mbufs_vec(struct ixgbe_tx_queue *txq)
 {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v4 1/3] ethdev: add API for buffer recycle mode
  2023-03-23 10:43   ` [PATCH v4 1/3] ethdev: add API for buffer recycle mode Feifei Wang
@ 2023-03-23 11:41     ` Morten Brørup
  2023-03-29  2:16       ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Morten Brørup @ 2023-03-23 11:41 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Thursday, 23 March 2023 11.43
> 

[...]

> +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> +		uint16_t queue_id, uint16_t nb)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->rxq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	if (!rte_eth_dev_is_valid_port(port_id)) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	if (!p->rx_descriptors_refill)

Compare to NULL instead: if (p->rx_descriptors_refill == NULL)

> +		return 0;
> +
> +	return p->rx_descriptors_refill(qd, nb);
> +}
> +
>  /**@{@name Rx hardware descriptor states
>   * @see rte_eth_rx_descriptor_status
>   */
> @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
>  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
>  }
> 
> +/**
> + * @internal
> + * Tx routine for rte_eth_dev_buf_recycle().
> + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> + *
> + * @note
> + * This API can only be called by rte_eth_dev_buf_recycle().
> + * After calling this API, rte_eth_rx_descriptors_refill() should be
> + * called to refill Rx ring descriptors.
> + *
> + * When this functionality is not implemented in the driver, the return
> + * buffer number is 0.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_buf_recycle_info
> + *   A pointer to a structure of Rx queue buffer ring information in buffer
> + *   recycle mode.
> + *
> + * @return
> + *   The number buffers correct to be filled in the Rx buffer ring.
> + *   - ENODEV: bad port or queue (only if compiled with debug).
> + */
> +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id, uint16_t
> queue_id,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->txq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	if (!rte_eth_dev_is_valid_port(port_id)) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		rte_erno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	if (p->tx_buf_stash == NULL)
> +		return 0;
> +
> +	return p->tx_buf_stash(qd, rxq_buf_recycle_info);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Buffer recycle mode can let Tx queue directly put used buffers into Rx
> buffer
> + * ring. This avoids freeing buffers into mempool and allocating buffers from
> + * mempool.
> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.
> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_recycle_info
> + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + * @return
> + *   - (0) on success or no recycling buffer.
> + *   - (-EINVAL) rxq_recycle_info is NULL.
> + */
> +__rte_experimental
> +static inline int
> +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	/* The number of recycling buffers. */
> +	uint16_t nb_buf;
> +
> +	if (!rxq_buf_recycle_info)
> +		return -EINVAL;

Compare to NULL instead: if (rxq_buf_recycle_info == NULL)

Alternatively: Consider RTE_ASSERT() and require rxq_buf_recycle_info to be valid.

> +
> +	/* Stash Tx used buffers into Rx buffer ring */
> +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> +				rxq_buf_recycle_info);
> +	/* If there are recycling buffers, refill Rx queue descriptors. */
> +	if (nb_buf)
> +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> +					nb_buf);
> +
> +	return 0;
> +}
> +
>  /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice
> diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> index dcf8adab92..10f9d5cbe7 100644
> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq,
> uint16_t offset);
>  /** @internal Check the status of a Tx descriptor */
>  typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> 
> +/** @internal Stash Tx used buffers into RX ring in buffer recycle mode */
> +typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> +
> +/** @internal Refill Rx descriptors in buffer recycle mode */
> +typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> +
>  /**
>   * @internal
>   * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> @@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
>  	eth_rx_queue_count_t rx_queue_count;
>  	/** Check the status of a Rx descriptor. */
>  	eth_rx_descriptor_status_t rx_descriptor_status;
> +	/** Refill Rx descriptors in buffer recycle mode */
> +	eth_rx_descriptors_refill_t rx_descriptors_refill;
>  	/** Rx queues data. */
>  	struct rte_ethdev_qdata rxq;
>  	uintptr_t reserved1[3];

In order to keep 64 B alignment, the reserved1 array must be reduced from 3 to 2.

> @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
>  	eth_tx_prep_t tx_pkt_prepare;
>  	/** Check the status of a Tx descriptor. */
>  	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> +	eth_tx_buf_stash_t tx_buf_stash;
>  	/** Tx queues data. */
>  	struct rte_ethdev_qdata txq;
>  	uintptr_t reserved2[3];

In order to keep 64 B alignment, the reserved2 array must be reduced from 3 to 2.

> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 357d1a88c0..8a4b1dac80 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -299,6 +299,10 @@ EXPERIMENTAL {
>  	rte_flow_action_handle_query_update;
>  	rte_flow_async_action_handle_query_update;
>  	rte_flow_async_create_by_index;
> +
> +	# added in 23.07
> +	rte_eth_dev_buf_recycle;
> +	rte_eth_rx_queue_buf_recycle_info_get;
>  };
> 
>  INTERNAL {
> @@ -328,4 +332,6 @@ INTERNAL {
>  	rte_eth_representor_id_get;
>  	rte_eth_switch_domain_alloc;
>  	rte_eth_switch_domain_free;
> +	rte_eth_tx_buf_stash;
> +	rte_eth_rx_descriptors_refill;
>  };
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v4 1/3] ethdev: add API for buffer recycle mode
  2023-03-23 11:41     ` Morten Brørup
@ 2023-03-29  2:16       ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-29  2:16 UTC (permalink / raw)
  To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

Hi, Morten

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, March 23, 2023 7:42 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Ferruh
> Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: RE: [PATCH v4 1/3] ethdev: add API for buffer recycle mode
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Thursday, 23 March 2023 11.43
> >
> 
> [...]
> 
> > +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> > +		uint16_t queue_id, uint16_t nb)
> > +{
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->rxq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	if (!p->rx_descriptors_refill)
> 
> Compare to NULL instead: if (p->rx_descriptors_refill == NULL)
> 
Ack.

> > +		return 0;
> > +
> > +	return p->rx_descriptors_refill(qd, nb); }
> > +
> >  /**@{@name Rx hardware descriptor states
> >   * @see rte_eth_rx_descriptor_status
> >   */
> > @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t
> queue_id,
> >  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);  }
> >
> > +/**
> > + * @internal
> > + * Tx routine for rte_eth_dev_buf_recycle().
> > + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> > + *
> > + * @note
> > + * This API can only be called by rte_eth_dev_buf_recycle().
> > + * After calling this API, rte_eth_rx_descriptors_refill() should be
> > + * called to refill Rx ring descriptors.
> > + *
> > + * When this functionality is not implemented in the driver, the
> > +return
> > + * buffer number is 0.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_buf_recycle_info
> > + *   A pointer to a structure of Rx queue buffer ring information in buffer
> > + *   recycle mode.
> > + *
> > + * @return
> > + *   The number buffers correct to be filled in the Rx buffer ring.
> > + *   - ENODEV: bad port or queue (only if compiled with debug).
> > + */
> > +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id,
> > +uint16_t
> > queue_id,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> {
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->txq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		rte_erno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	if (p->tx_buf_stash == NULL)
> > +		return 0;
> > +
> > +	return p->tx_buf_stash(qd, rxq_buf_recycle_info); }
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Buffer recycle mode can let Tx queue directly put used buffers
> > +into Rx
> > buffer
> > + * ring. This avoids freeing buffers into mempool and allocating
> > +buffers from
> > + * mempool.
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_recycle_info
> > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + * @return
> > + *   - (0) on success or no recycling buffer.
> > + *   - (-EINVAL) rxq_recycle_info is NULL.
> > + */
> > +__rte_experimental
> > +static inline int
> > +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> {
> > +	/* The number of recycling buffers. */
> > +	uint16_t nb_buf;
> > +
> > +	if (!rxq_buf_recycle_info)
> > +		return -EINVAL;
> 
> Compare to NULL instead: if (rxq_buf_recycle_info == NULL)
> 
Ack.
> Alternatively: Consider RTE_ASSERT() and require rxq_buf_recycle_info to
> be valid.
> 
Good comments. This ensure users to create a rxq_buf_recycle_info variable before
call this info_get API.
> > +
> > +	/* Stash Tx used buffers into Rx buffer ring */
> > +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> > +				rxq_buf_recycle_info);
> > +	/* If there are recycling buffers, refill Rx queue descriptors. */
> > +	if (nb_buf)
> > +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> > +					nb_buf);
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > index dcf8adab92..10f9d5cbe7 100644
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > *rxq, uint16_t offset);
> >  /** @internal Check the status of a Tx descriptor */  typedef int
> > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> >
> > +/** @internal Stash Tx used buffers into RX ring in buffer recycle
> > +mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> > +
> > +/** @internal Refill Rx descriptors in buffer recycle mode */ typedef
> > +uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> > +
> >  /**
> >   * @internal
> >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > -90,6 +97,8 @@ struct rte_eth_fp_ops {
> >  	eth_rx_queue_count_t rx_queue_count;
> >  	/** Check the status of a Rx descriptor. */
> >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > +	/** Refill Rx descriptors in buffer recycle mode */
> > +	eth_rx_descriptors_refill_t rx_descriptors_refill;
> >  	/** Rx queues data. */
> >  	struct rte_ethdev_qdata rxq;
> >  	uintptr_t reserved1[3];
> 
> In order to keep 64 B alignment, the reserved1 array must be reduced from 3
> to 2.
Agree.
> 
> > @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
> >  	eth_tx_prep_t tx_pkt_prepare;
> >  	/** Check the status of a Tx descriptor. */
> >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> > +	eth_tx_buf_stash_t tx_buf_stash;
> >  	/** Tx queues data. */
> >  	struct rte_ethdev_qdata txq;
> >  	uintptr_t reserved2[3];
> 
> In order to keep 64 B alignment, the reserved2 array must be reduced from 3
> to 2.
Agree.
> 
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 357d1a88c0..8a4b1dac80 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -299,6 +299,10 @@ EXPERIMENTAL {
> >  	rte_flow_action_handle_query_update;
> >  	rte_flow_async_action_handle_query_update;
> >  	rte_flow_async_create_by_index;
> > +
> > +	# added in 23.07
> > +	rte_eth_dev_buf_recycle;
> > +	rte_eth_rx_queue_buf_recycle_info_get;
> >  };
> >
> >  INTERNAL {
> > @@ -328,4 +332,6 @@ INTERNAL {
> >  	rte_eth_representor_id_get;
> >  	rte_eth_switch_domain_alloc;
> >  	rte_eth_switch_domain_free;
> > +	rte_eth_tx_buf_stash;
> > +	rte_eth_rx_descriptors_refill;
> >  };
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v5 0/3] Recycle buffers from Tx to Rx
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
                   ` (5 preceding siblings ...)
  2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
@ 2023-03-30  6:29 ` Feifei Wang
  2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
                     ` (4 more replies)
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
  7 siblings, 5 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-30  6:29 UTC (permalink / raw)
  Cc: dev, konstantin.v.ananyev, mb, nd, Feifei Wang

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as buffer
recycle mode.

In the latest version, buffer recycle mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_buf_recycle_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at runtime and user does not need to
know the direction of flow in advance. This can effectively expand buffer recycle mode's
use scenarios.

Furthermore, buffer recycle mode is no longer limited to the same pmd,
it can support moving buffers between different vendor pmds, even can put the buffer
anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by buffer recycle mode.

Difference between buffer recycle, ZC API used in mempool and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For buffer recycle:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, buffer recycle reduce 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
So, buffer recycle has its own benefits.

Testing status:
(1) dpdk l3fwd test with multiple drivers:
    port 0: 82599 NIC   port 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Thunderx2:      +7.53%	                +13.54%
-------------------------------------------------------------

(2) dpdk l3fwd test with same driver:
    port 0 && 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Ampere altra:   +12.61%		        +11.42%
n1sdp:		+8.30%			+3.85%
x86-sse:	+8.43%			+3.72%
-------------------------------------------------------------

(3) Performance comparison with ZC_mempool used
    port 0 && 1: XL710 NIC
    with fast free
-------------------------------------------------------------
		With recycle buffer	With zc_mempool
Ampere altra:	11.42%			3.54%
-------------------------------------------------------------

V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)

V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm

v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)

v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path

Feifei Wang (3):
  ethdev: add API for buffer recycle mode
  net/i40e: implement recycle buffer mode
  net/ixgbe: implement recycle buffer mode

 drivers/net/i40e/i40e_ethdev.c   |   1 +
 drivers/net/i40e/i40e_ethdev.h   |   2 +
 drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h     |   4 +
 drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
 lib/ethdev/ethdev_driver.h       |  10 ++
 lib/ethdev/ethdev_private.c      |   2 +
 lib/ethdev/rte_ethdev.c          |  33 +++++
 lib/ethdev/rte_ethdev.h          | 230 +++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h     |  15 +-
 lib/ethdev/version.map           |   6 +
 14 files changed, 621 insertions(+), 2 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
@ 2023-03-30  6:29   ` Feifei Wang
  2023-03-30  7:19     ` Morten Brørup
  2023-04-19 14:46     ` Ferruh Yigit
  2023-03-30  6:29   ` [PATCH v5 2/3] net/i40e: implement recycle buffer mode Feifei Wang
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-30  6:29 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, mb, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

There are 4 upper APIs for buffer recycle mode:
1. 'rte_eth_rx_queue_buf_recycle_info_get'
This is to retrieve buffer ring information about given ports's Rx
queue in buffer recycle mode. And due to this, buffer recycle can
be no longer limited to the same driver in Rx and Tx.

2. 'rte_eth_dev_buf_recycle'
Users can call this API to enable buffer recycle mode in data path.
There are 2 internal APIs in it, which is separately for Rx and TX.

3. 'rte_eth_tx_buf_stash'
Internal API for buffer recycle mode. This is to stash Tx used
buffers into Rx buffer ring.

4. 'rte_eth_rx_descriptors_refill'
Internal API for buffer recycle mode. This is to refill Rx
descriptors.

Above all APIs are just implemented at the upper level.
For different APIs, we need to define specific functions separately.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/ethdev/ethdev_driver.h   |  10 ++
 lib/ethdev/ethdev_private.c  |   2 +
 lib/ethdev/rte_ethdev.c      |  33 +++++
 lib/ethdev/rte_ethdev.h      | 230 +++++++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h |  15 ++-
 lib/ethdev/version.map       |   6 +
 6 files changed, 294 insertions(+), 2 deletions(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 2c9d615fb5..412f064975 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@ struct rte_eth_dev {
 	eth_rx_descriptor_status_t rx_descriptor_status;
 	/** Check the status of a Tx descriptor */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Stash Tx used buffers into RX ring in buffer recycle mode */
+	eth_tx_buf_stash_t tx_buf_stash;
+	/** Refill Rx descriptors in buffer recycle mode */
+	eth_rx_descriptors_refill_t rx_descriptors_refill;
 
 	/**
 	 * Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
 	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
 
+typedef void (*eth_rxq_buf_recycle_info_get_t)(struct rte_eth_dev *dev,
+	uint16_t rx_queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
 	uint16_t queue_id, struct rte_eth_burst_mode *mode);
 
@@ -1247,6 +1255,8 @@ struct eth_dev_ops {
 	eth_rxq_info_get_t         rxq_info_get;
 	/** Retrieve Tx queue information */
 	eth_txq_info_get_t         txq_info_get;
+	/** Get Rx queue buffer recycle information */
+	eth_rxq_buf_recycle_info_get_t rxq_buf_recycle_info_get;
 	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst mode */
 	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst mode */
 	eth_fw_version_get_t       fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8d0ae9226 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
 	fpo->rx_queue_count = dev->rx_queue_count;
 	fpo->rx_descriptor_status = dev->rx_descriptor_status;
 	fpo->tx_descriptor_status = dev->tx_descriptor_status;
+	fpo->tx_buf_stash = dev->tx_buf_stash;
+	fpo->rx_descriptors_refill = dev->rx_descriptors_refill;
 
 	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 4d03255683..36c3a17588 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5784,6 +5784,39 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	return 0;
 }
 
+int
+rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	RTE_ASSERT(rxq_buf_recycle_info != NULL);
+
+	if (dev->data->rx_queues == NULL ||
+			dev->data->rx_queues[queue_id] == NULL) {
+		RTE_ETHDEV_LOG(ERR,
+			   "Rx queue %"PRIu16" of device with port_id=%"
+			   PRIu16" has not been setup\n",
+			   queue_id, port_id);
+		return -EINVAL;
+	}
+
+	if (*dev->dev_ops->rxq_buf_recycle_info_get == NULL)
+		return -ENOTSUP;
+
+	dev->dev_ops->rxq_buf_recycle_info_get(dev, queue_id, rxq_buf_recycle_info);
+
+	return 0;
+}
+
 int
 rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 			  struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 99fe9e238b..016c16615d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,29 @@ struct rte_eth_txq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 } __rte_cache_min_aligned;
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue buffer ring information structure in buffer recycle mode.
+ * Used to retrieve Rx queue buffer ring information when Tx queue stashing used buffers
+ * in Rx buffer ring.
+ */
+struct rte_eth_rxq_buf_recycle_info {
+	struct rte_mbuf **buf_ring; /**< buffer ring of Rx queue. */
+	struct rte_mempool *mp;     /**< mempool of Rx queue. */
+	uint16_t *refill_head;      /**< head of buffer ring refilling descriptors. */
+	uint16_t *receive_tail;     /**< tail of buffer ring receiving pkts. */
+	uint16_t buf_ring_size;     /**< configured number of buffer ring size. */
+	/**
+	 *  request for number of Rx refill buffers.
+	 *   For some PMD drivers, Rx refiil buffers number should be aligned with
+	 *   its buffer ring size. This is to simplify ring wraparound.
+	 *   Value 0 means that no request for this.
+	 */
+	uint16_t refill_request;
+} __rte_cache_min_aligned;
+
 /* Generic Burst mode flag definition, values can be ORed. */
 
 /**
@@ -4809,6 +4832,32 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve buffer ring information about given ports's Rx queue in buffer recycle
+ * mode.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet device for which buffer ring information
+ *   will be retrieved.
+ * @param rxq_buf_recycle_info
+ *   A pointer to a structure of type *rte_eth_rxq_buf_recycle_info* to be filled.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENODEV:  If *port_id* is invalid.
+ *   - -ENOTSUP: routine is not supported by the device PMD.
+ *   - -EINVAL:  The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id,
+		uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 /**
  * Retrieve information about the Rx packet burst mode.
  *
@@ -5987,6 +6036,71 @@ rte_eth_rx_queue_count(uint16_t port_id, uint16_t queue_id)
 	return (int)(*p->rx_queue_count)(qd);
 }
 
+/**
+ * @internal
+ * Rx routine for rte_eth_dev_buf_recycle().
+ * Refill Rx descriptors in buffer recycle mode.
+ *
+ * @note
+ * This API can only be called by rte_eth_dev_buf_recycle().
+ * Before calling this API, rte_eth_tx_buf_stash() should be
+ * called to stash Tx used buffers into Rx buffer ring.
+ *
+ * When this functionality is not implemented in the driver, the return
+ * buffer number is 0.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the receive queue.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ *@param nb
+ *   The number of Rx descriptors to be refilled.
+ * @return
+ *   The number Rx descriptors correct to be refilled.
+ *   - ENODEV: bad port or queue (only if compiled with debug).
+ */
+static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
+		uint16_t queue_id, uint16_t nb)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->rxq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
+		rte_errno = ENODEV;
+		return 0;
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	if (p->rx_descriptors_refill == NULL)
+		return 0;
+
+	return p->rx_descriptors_refill(qd, nb);
+}
+
 /**@{@name Rx hardware descriptor states
  * @see rte_eth_rx_descriptor_status
  */
@@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+/**
+ * @internal
+ * Tx routine for rte_eth_dev_buf_recycle().
+ * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
+ *
+ * @note
+ * This API can only be called by rte_eth_dev_buf_recycle().
+ * After calling this API, rte_eth_rx_descriptors_refill() should be
+ * called to refill Rx ring descriptors.
+ *
+ * When this functionality is not implemented in the driver, the return
+ * buffer number is 0.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The index of the transmit queue.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_buf_recycle_info
+ *   A pointer to a structure of Rx queue buffer ring information in buffer
+ *   recycle mode.
+ *
+ * @return
+ *   The number buffers correct to be filled in the Rx buffer ring.
+ *   - ENODEV: bad port or queue (only if compiled with debug).
+ */
+static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (port_id >= RTE_MAX_ETHPORTS ||
+			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+			"Invalid port_id=%u or queue_id=%u\n",
+			port_id, queue_id);
+		rte_errno = ENODEV;
+		return 0;
+	}
+#endif
+
+	p = &rte_eth_fp_ops[port_id];
+	qd = p->txq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (!rte_eth_dev_is_valid_port(port_id)) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
+		rte_errno = ENODEV;
+		return 0;
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+			queue_id, port_id);
+		rte_erno = ENODEV;
+		return 0;
+	}
+#endif
+
+	if (p->tx_buf_stash == NULL)
+		return 0;
+
+	return p->tx_buf_stash(qd, rxq_buf_recycle_info);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Buffer recycle mode can let Tx queue directly put used buffers into Rx buffer
+ * ring. This avoids freeing buffers into mempool and allocating buffers from
+ * mempool.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param rxq_recycle_info
+ *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ * @return
+ *   - (0) on success or no recycling buffer.
+ *   - (-EINVAL) rxq_recycle_info is NULL.
+ */
+__rte_experimental
+static inline int
+rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	/* The number of recycling buffers. */
+	uint16_t nb_buf;
+
+	if (!rxq_buf_recycle_info)
+		return -EINVAL;
+
+	/* Stash Tx used buffers into Rx buffer ring */
+	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
+				rxq_buf_recycle_info);
+	/* If there are recycling buffers, refill Rx queue descriptors. */
+	if (nb_buf)
+		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
+					nb_buf);
+
+	return 0;
+}
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..a138fd4dbc 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
 /** @internal Check the status of a Tx descriptor */
 typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
 
+/** @internal Stash Tx used buffers into RX ring in buffer recycle mode */
+typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
+/** @internal Refill Rx descriptors in buffer recycle mode */
+typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
 /**
  * @internal
  * Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,9 +97,11 @@ struct rte_eth_fp_ops {
 	eth_rx_queue_count_t rx_queue_count;
 	/** Check the status of a Rx descriptor. */
 	eth_rx_descriptor_status_t rx_descriptor_status;
+	/** Refill Rx descriptors in buffer recycle mode */
+	eth_rx_descriptors_refill_t rx_descriptors_refill;
 	/** Rx queues data. */
 	struct rte_ethdev_qdata rxq;
-	uintptr_t reserved1[3];
+	uintptr_t reserved1[4];
 	/**@}*/
 
 	/**@{*/
@@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
 	eth_tx_prep_t tx_pkt_prepare;
 	/** Check the status of a Tx descriptor. */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Stash Tx used buffers into RX ring in buffer recycle mode */
+	eth_tx_buf_stash_t tx_buf_stash;
 	/** Tx queues data. */
 	struct rte_ethdev_qdata txq;
-	uintptr_t reserved2[3];
+	uintptr_t reserved2[4];
 	/**@}*/
 
 } __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 357d1a88c0..8a4b1dac80 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -299,6 +299,10 @@ EXPERIMENTAL {
 	rte_flow_action_handle_query_update;
 	rte_flow_async_action_handle_query_update;
 	rte_flow_async_create_by_index;
+
+	# added in 23.07
+	rte_eth_dev_buf_recycle;
+	rte_eth_rx_queue_buf_recycle_info_get;
 };
 
 INTERNAL {
@@ -328,4 +332,6 @@ INTERNAL {
 	rte_eth_representor_id_get;
 	rte_eth_switch_domain_alloc;
 	rte_eth_switch_domain_free;
+	rte_eth_tx_buf_stash;
+	rte_eth_rx_descriptors_refill;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v5 2/3] net/i40e: implement recycle buffer mode
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
  2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
@ 2023-03-30  6:29   ` Feifei Wang
  2023-03-30  6:29   ` [PATCH v5 3/3] net/ixgbe: " Feifei Wang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-30  6:29 UTC (permalink / raw)
  To: Yuying Zhang, Beilei Xing
  Cc: dev, konstantin.v.ananyev, mb, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for i40e driver.
Currently, recycle buffer mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_ethdev.c |   1 +
 drivers/net/i40e/i40e_ethdev.h |   2 +
 drivers/net/i40e/i40e_rxtx.c   | 159 +++++++++++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.h   |   4 +
 4 files changed, 166 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index cb0070f94b..456fb256f5 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.flow_ops_get                 = i40e_dev_flow_ops_get,
 	.rxq_info_get                 = i40e_rxq_info_get,
 	.txq_info_get                 = i40e_txq_info_get,
+	.rxq_buf_recycle_info_get     = i40e_rxq_buf_recycle_info_get,
 	.rx_burst_mode_get            = i40e_rx_burst_mode_get,
 	.tx_burst_mode_get            = i40e_tx_burst_mode_get,
 	.timesync_enable              = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 9b806d130e..83c5ff5859 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
+void i40e_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
 int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
 			   struct rte_eth_burst_mode *mode);
 int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..55c788317f 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1536,6 +1536,134 @@ i40e_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+uint16_t
+i40e_tx_buf_stash_vec(void *tx_queue,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct i40e_tx_queue *txq = tx_queue;
+	struct i40e_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint16_t avail = 0;
+	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
+	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
+	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
+	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
+	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Buffer recycle can only support no ring buffer wraparound.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * buffer ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_request.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with buffer ring size. In this case, the update of refill head can not
+	 * exceed the Rx buffer ring size.
+	 */
+	if (refill_request != n ||
+		(!refill_request && (refill_head + n > buf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = rxq_buf_recycle_info->buf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempoo. */
+			if (unlikely(rxq_buf_recycle_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_request))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
+
+uint16_t
+i40e_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	struct i40e_rx_entry *rxep;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* flush desc with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= nb;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			(rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	rte_io_wmb();
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+
+	return nb;
+}
+
 /*********************************************************************
  *
  *  TX simple prep functions
@@ -3197,6 +3325,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.offloads = txq->offloads;
 }
 
+void
+i40e_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct i40e_rx_queue *rxq;
+	struct i40e_adapter *ad =
+		I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	rxq_buf_recycle_info->buf_ring = (void *)rxq->sw_ring;
+	rxq_buf_recycle_info->mp = rxq->mp;
+	rxq_buf_recycle_info->buf_ring_size = rxq->nb_rx_desc;
+	rxq_buf_recycle_info->receive_tail = &rxq->rx_tail;
+
+	if (ad->rx_vec_allowed) {
+		rxq_buf_recycle_info->refill_request = RTE_I40E_RXQ_REARM_THRESH;
+		rxq_buf_recycle_info->refill_head = &rxq->rxrearm_start;
+	} else {
+		rxq_buf_recycle_info->refill_request = rxq->rx_free_thresh;
+		rxq_buf_recycle_info->refill_head = &rxq->rx_free_trigger;
+	}
+}
+
 #ifdef RTE_ARCH_X86
 static inline bool
 get_avx_supported(bool request_avx512)
@@ -3291,6 +3443,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
 				dev->rx_pkt_burst = ad->rx_use_avx2 ?
 					i40e_recv_scattered_pkts_vec_avx2 :
 					i40e_recv_scattered_pkts_vec;
+				dev->rx_descriptors_refill =
+					i40e_rx_descriptors_refill_vec;
 			}
 		} else {
 			if (ad->rx_use_avx512) {
@@ -3309,9 +3463,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
 				dev->rx_pkt_burst = ad->rx_use_avx2 ?
 					i40e_recv_pkts_vec_avx2 :
 					i40e_recv_pkts_vec;
+				dev->rx_descriptors_refill =
+					i40e_rx_descriptors_refill_vec;
 			}
 		}
 #else /* RTE_ARCH_X86 */
+		dev->rx_descriptors_refill = i40e_rx_descriptors_refill_vec;
 		if (dev->data->scattered_rx) {
 			PMD_INIT_LOG(DEBUG,
 				     "Using Vector Scattered Rx (port %d).",
@@ -3479,6 +3636,7 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 				dev->tx_pkt_burst = ad->tx_use_avx2 ?
 						    i40e_xmit_pkts_vec_avx2 :
 						    i40e_xmit_pkts_vec;
+				dev->tx_buf_stash = i40e_tx_buf_stash_vec;
 			}
 #else /* RTE_ARCH_X86 */
 			PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
@@ -3488,6 +3646,7 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 		} else {
 			PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
 			dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+			dev->tx_buf_stash = i40e_tx_buf_stash_vec;
 		}
 		dev->tx_pkt_prepare = i40e_simple_prep_pkts;
 	} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..0ad8f530f9 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
 int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
 int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
 
+uint16_t i40e_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+uint16_t i40e_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb);
+
 uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			    uint16_t nb_pkts);
 uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v5 3/3] net/ixgbe: implement recycle buffer mode
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
  2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
  2023-03-30  6:29   ` [PATCH v5 2/3] net/i40e: implement recycle buffer mode Feifei Wang
@ 2023-03-30  6:29   ` Feifei Wang
  2023-04-19 14:46     ` Ferruh Yigit
  2023-03-30 15:04   ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Stephen Hemminger
  2023-04-19 14:56   ` Ferruh Yigit
  4 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-03-30  6:29 UTC (permalink / raw)
  To: Qiming Yang, Wenjun Wu
  Cc: dev, konstantin.v.ananyev, mb, nd, Feifei Wang,
	Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
 drivers/net/ixgbe/ixgbe_rxtx.c   | 153 +++++++++++++++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
 4 files changed, 161 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 88118bc305..3bada9abbd 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.set_mc_addr_list     = ixgbe_dev_set_mc_addr_list,
 	.rxq_info_get         = ixgbe_rxq_info_get,
 	.txq_info_get         = ixgbe_txq_info_get,
+	.rxq_buf_recycle_info_get = ixgbe_rxq_buf_recycle_info_get,
 	.timesync_enable      = ixgbe_timesync_enable,
 	.timesync_disable     = ixgbe_timesync_disable,
 	.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 48290af512..ca6aa0da64 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -625,6 +625,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+void ixgbe_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+
 int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
 
 void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index c9d6ca9efe..ee27121315 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -953,6 +953,133 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+uint16_t
+ixgbe_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct ixgbe_tx_queue *txq = tx_queue;
+	struct ixgbe_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint32_t status;
+	uint16_t avail = 0;
+	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
+	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
+	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
+	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
+	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	status = txq->tx_ring[txq->tx_next_dd].wb.status;
+	if (!(status & IXGBE_ADVTXD_STAT_DD))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Buffer recycle can only support no ring buffer wraparound.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * buffer ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_request.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with buffer ring size. In this case, the update of refill head can not
+	 * exceed the Rx buffer ring size.
+	 */
+	if (refill_request != n ||
+		(!refill_request && (refill_head + n > buf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = rxq_buf_recycle_info->buf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempoo. */
+			if (unlikely(rxq_buf_recycle_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_request))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
+
+uint16_t
+ixgbe_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb)
+{
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	struct ixgbe_rx_entry *rxep;
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* flush desc with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= nb;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			(rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+
+	return nb;
+}
+
 /*********************************************************************
  *
  *  TX prep functions
@@ -2558,6 +2685,7 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
 				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
 					ixgbe_txq_vec_setup(txq) == 0)) {
 			PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+			dev->tx_buf_stash = ixgbe_tx_buf_stash_vec;
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		} else
 		dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4823,6 +4951,7 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
 					    "callback (port=%d).",
 				     dev->data->port_id);
 
+			dev->rx_descriptors_refill = ixgbe_rx_descriptors_refill_vec;
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
 		} else if (adapter->rx_bulk_alloc_allowed) {
 			PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4852,6 +4981,7 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
 			     RTE_IXGBE_DESCS_PER_LOOP,
 			     dev->data->port_id);
 
+		dev->rx_descriptors_refill = ixgbe_rx_descriptors_refill_vec;
 		dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
 	} else if (adapter->rx_bulk_alloc_allowed) {
 		PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5623,6 +5753,29 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
 }
 
+void
+ixgbe_rxq_buf_recycle_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
+{
+	struct ixgbe_rx_queue *rxq;
+	struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	rxq_buf_recycle_info->buf_ring = (void *)rxq->sw_ring;
+	rxq_buf_recycle_info->mp = rxq->mb_pool;
+	rxq_buf_recycle_info->buf_ring_size = rxq->nb_rx_desc;
+	rxq_buf_recycle_info->receive_tail = &rxq->rx_tail;
+
+	if (adapter->rx_vec_allowed) {
+		rxq_buf_recycle_info->refill_request = RTE_IXGBE_RXQ_REARM_THRESH;
+		rxq_buf_recycle_info->refill_head = &rxq->rxrearm_start;
+	} else {
+		rxq_buf_recycle_info->refill_request = rxq->rx_free_thresh;
+		rxq_buf_recycle_info->refill_head = &rxq->rx_free_trigger;
+	}
+}
+
 /*
  * [VF] Initializes Receive Unit.
  */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..18f890f91a 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
 extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
 extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
 
+uint16_t ixgbe_tx_buf_stash_vec(void *tx_queue,
+		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
+uint16_t ixgbe_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb);
+
 uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 				    uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
@ 2023-03-30  7:19     ` Morten Brørup
  2023-03-30  9:31       ` Feifei Wang
  2023-04-19 14:46     ` Ferruh Yigit
  1 sibling, 1 reply; 67+ messages in thread
From: Morten Brørup @ 2023-03-30  7:19 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Thursday, 30 March 2023 08.30
> 

[...]

> +/**
> + * @internal
> + * Rx routine for rte_eth_dev_buf_recycle().
> + * Refill Rx descriptors in buffer recycle mode.
> + *
> + * @note
> + * This API can only be called by rte_eth_dev_buf_recycle().
> + * Before calling this API, rte_eth_tx_buf_stash() should be
> + * called to stash Tx used buffers into Rx buffer ring.
> + *
> + * When this functionality is not implemented in the driver, the return
> + * buffer number is 0.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the receive queue.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + *@param nb
> + *   The number of Rx descriptors to be refilled.
> + * @return
> + *   The number Rx descriptors correct to be refilled.
> + *   - ENODEV: bad port or queue (only if compiled with debug).

If you want errors reported by the return value, the function return type cannot be uint16_t.

> + */
> +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> +		uint16_t queue_id, uint16_t nb)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		rte_errno = ENODEV;
> +		return 0;

If p->rx_descriptors_refill() is likely to return 0, this function should not use 0 as return value to indicate errors.

> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->rxq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> +	if (!rte_eth_dev_is_valid_port(port_id)) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	if (p->rx_descriptors_refill == NULL)
> +		return 0;
> +
> +	return p->rx_descriptors_refill(qd, nb);
> +}
> +
>  /**@{@name Rx hardware descriptor states
>   * @see rte_eth_rx_descriptor_status
>   */
> @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
>  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
>  }
> 
> +/**
> + * @internal
> + * Tx routine for rte_eth_dev_buf_recycle().
> + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> + *
> + * @note
> + * This API can only be called by rte_eth_dev_buf_recycle().
> + * After calling this API, rte_eth_rx_descriptors_refill() should be
> + * called to refill Rx ring descriptors.
> + *
> + * When this functionality is not implemented in the driver, the return
> + * buffer number is 0.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_buf_recycle_info
> + *   A pointer to a structure of Rx queue buffer ring information in buffer
> + *   recycle mode.
> + *
> + * @return
> + *   The number buffers correct to be filled in the Rx buffer ring.
> + *   - ENODEV: bad port or queue (only if compiled with debug).

If you want errors reported by the return value, the function return type cannot be uint16_t.

> + */
> +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id, uint16_t
> queue_id,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	struct rte_eth_fp_ops *p;
> +	void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	if (port_id >= RTE_MAX_ETHPORTS ||
> +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Invalid port_id=%u or queue_id=%u\n",
> +			port_id, queue_id);
> +		rte_errno = ENODEV;
> +		return 0;

If p->tx_buf_stash() is likely to return 0, this function should not use 0 as return value to indicate errors.

> +	}
> +#endif
> +
> +	p = &rte_eth_fp_ops[port_id];
> +	qd = p->txq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> +	if (!rte_eth_dev_is_valid_port(port_id)) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> +		rte_errno = ENODEV;
> +		return 0;
> +
> +	if (qd == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> +			queue_id, port_id);
> +		rte_erno = ENODEV;
> +		return 0;
> +	}
> +#endif
> +
> +	if (p->tx_buf_stash == NULL)
> +		return 0;
> +
> +	return p->tx_buf_stash(qd, rxq_buf_recycle_info);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Buffer recycle mode can let Tx queue directly put used buffers into Rx
> buffer
> + * ring. This avoids freeing buffers into mempool and allocating buffers from
> + * mempool.

This function description generally describes the buffer recycle mode.

Please update it to describe what this specific function does.

> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.
> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param rxq_recycle_info
> + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + * @return
> + *   - (0) on success or no recycling buffer.

Why not return the return value of rte_eth_rx_descriptors_refill() instead of 0 on success? (This is a question, not a suggestion.)

Or, if rxq_buf_recycle_info must be valid, the function return type could be void instead of int.

> + *   - (-EINVAL) rxq_recycle_info is NULL.
> + */
> +__rte_experimental
> +static inline int
> +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	/* The number of recycling buffers. */
> +	uint16_t nb_buf;
> +
> +	if (!rxq_buf_recycle_info)
> +		return -EINVAL;

This is a fast path function. In which situation is this function called with rxq_buf_recycle_info == NULL?

If this function can genuinely be called with rxq_buf_recycle_info == NULL, you should test for (rxq_buf_recycle_info == NULL), not (! rxq_buf_recycle_info). Otherwise, I think RTE_ASSERT(rxq_buf_recycle_info != NULL) is more appropriate.

> +
> +	/* Stash Tx used buffers into Rx buffer ring */
> +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> +				rxq_buf_recycle_info);
> +	/* If there are recycling buffers, refill Rx queue descriptors. */
> +	if (nb_buf)
> +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> +					nb_buf);
> +
> +	return 0;
> +}
> +
>  /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice
> diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> index dcf8adab92..a138fd4dbc 100644
> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq,
> uint16_t offset);
>  /** @internal Check the status of a Tx descriptor */
>  typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> 
> +/** @internal Stash Tx used buffers into RX ring in buffer recycle mode */
> +typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> +
> +/** @internal Refill Rx descriptors in buffer recycle mode */
> +typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);

Please add proper function descriptions for the two callbacks above.

> +
>  /**
>   * @internal
>   * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> @@ -90,9 +97,11 @@ struct rte_eth_fp_ops {
>  	eth_rx_queue_count_t rx_queue_count;
>  	/** Check the status of a Rx descriptor. */
>  	eth_rx_descriptor_status_t rx_descriptor_status;
> +	/** Refill Rx descriptors in buffer recycle mode */
> +	eth_rx_descriptors_refill_t rx_descriptors_refill;
>  	/** Rx queues data. */
>  	struct rte_ethdev_qdata rxq;
> -	uintptr_t reserved1[3];
> +	uintptr_t reserved1[4];

You added a function pointer above, so to keep the structure alignment, you must remove one here, not add one:

-	uintptr_t reserved1[3];
+	uintptr_t reserved1[2];

>  	/**@}*/
> 
>  	/**@{*/
> @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
>  	eth_tx_prep_t tx_pkt_prepare;
>  	/** Check the status of a Tx descriptor. */
>  	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> +	eth_tx_buf_stash_t tx_buf_stash;
>  	/** Tx queues data. */
>  	struct rte_ethdev_qdata txq;
> -	uintptr_t reserved2[3];
> +	uintptr_t reserved2[4];

You added a function pointer above, so to keep the structure alignment, you must remove one here, not add one:

-	uintptr_t reserved1[3];
+	uintptr_t reserved1[2];

>  	/**@}*/
> 
>  } __rte_cache_aligned;

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  7:19     ` Morten Brørup
@ 2023-03-30  9:31       ` Feifei Wang
  2023-03-30 15:15         ` Morten Brørup
  2023-03-30 15:58         ` Morten Brørup
  0 siblings, 2 replies; 67+ messages in thread
From: Feifei Wang @ 2023-03-30  9:31 UTC (permalink / raw)
  To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, March 30, 2023 3:19 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Ferruh
> Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Thursday, 30 March 2023 08.30
> >
> 
> [...]
> 
> > +/**
> > + * @internal
> > + * Rx routine for rte_eth_dev_buf_recycle().
> > + * Refill Rx descriptors in buffer recycle mode.
> > + *
> > + * @note
> > + * This API can only be called by rte_eth_dev_buf_recycle().
> > + * Before calling this API, rte_eth_tx_buf_stash() should be
> > + * called to stash Tx used buffers into Rx buffer ring.
> > + *
> > + * When this functionality is not implemented in the driver, the
> > +return
> > + * buffer number is 0.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the receive queue.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + *@param nb
> > + *   The number of Rx descriptors to be refilled.
> > + * @return
> > + *   The number Rx descriptors correct to be refilled.
> > + *   - ENODEV: bad port or queue (only if compiled with debug).
> 
> If you want errors reported by the return value, the function return type
> cannot be uint16_t.
Agree. Actually, in the code path, if errors happen, the function will return 0.
For this description line, I refer to 'rte_eth_tx_prepare' notes. Maybe we should delete
this line.

> 
> > + */
> > +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> > +		uint16_t queue_id, uint16_t nb)
> > +{
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> 
> If p->rx_descriptors_refill() is likely to return 0, this function should not use 0
> as return value to indicate errors.
However, refer to dpdk code style in ethdev, most of API write like this.
For example, 'rte_eth_rx/tx_burst', 'rte_eth_tx_prep'. 

I'm also confused what's return type for this due to I want
to indicate errors and show the processed buffer number.

> 
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->rxq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	if (p->rx_descriptors_refill == NULL)
> > +		return 0;
> > +
> > +	return p->rx_descriptors_refill(qd, nb); }
> > +
> >  /**@{@name Rx hardware descriptor states
> >   * @see rte_eth_rx_descriptor_status
> >   */
> > @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t
> queue_id,
> >  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);  }
> >
> > +/**
> > + * @internal
> > + * Tx routine for rte_eth_dev_buf_recycle().
> > + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> > + *
> > + * @note
> > + * This API can only be called by rte_eth_dev_buf_recycle().
> > + * After calling this API, rte_eth_rx_descriptors_refill() should be
> > + * called to refill Rx ring descriptors.
> > + *
> > + * When this functionality is not implemented in the driver, the
> > +return
> > + * buffer number is 0.
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *   The index of the transmit queue.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_buf_recycle_info
> > + *   A pointer to a structure of Rx queue buffer ring information in buffer
> > + *   recycle mode.
> > + *
> > + * @return
> > + *   The number buffers correct to be filled in the Rx buffer ring.
> > + *   - ENODEV: bad port or queue (only if compiled with debug).
> 
> If you want errors reported by the return value, the function return type
> cannot be uint16_t.
> 
> > + */
> > +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id,
> > +uint16_t
> > queue_id,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> {
> > +	struct rte_eth_fp_ops *p;
> > +	void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Invalid port_id=%u or queue_id=%u\n",
> > +			port_id, queue_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> 
> If p->tx_buf_stash() is likely to return 0, this function should not use 0 as
> return value to indicate errors.
> 
> > +	}
> > +#endif
> > +
> > +	p = &rte_eth_fp_ops[port_id];
> > +	qd = p->txq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> > +		rte_errno = ENODEV;
> > +		return 0;
> > +
> > +	if (qd == NULL) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> port_id=%u\n",
> > +			queue_id, port_id);
> > +		rte_erno = ENODEV;
> > +		return 0;
> > +	}
> > +#endif
> > +
> > +	if (p->tx_buf_stash == NULL)
> > +		return 0;
> > +
> > +	return p->tx_buf_stash(qd, rxq_buf_recycle_info); }
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Buffer recycle mode can let Tx queue directly put used buffers
> > +into Rx
> > buffer
> > + * ring. This avoids freeing buffers into mempool and allocating
> > + buffers from
> > + * mempool.
> 
> This function description generally describes the buffer recycle mode.
> 
> Please update it to describe what this specific function does.
Ack.
> 
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param rxq_recycle_info
> > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + * @return
> > + *   - (0) on success or no recycling buffer.
> 
> Why not return the return value of rte_eth_rx_descriptors_refill() instead of
> 0 on success? (This is a question, not a suggestion.)
> 
> Or, if rxq_buf_recycle_info must be valid, the function return type could be
> void instead of int.
> 
In some time, users may forget to allocate the room for ' rxq_buf_recycle_info '
and call 'rte_rxq_buf_recycle_info_get' API. Thus, I think we need to check with this.

Furthermore, the return value of this API should return success or not.

> > + *   - (-EINVAL) rxq_recycle_info is NULL.
> > + */
> > +__rte_experimental
> > +static inline int
> > +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> {
> > +	/* The number of recycling buffers. */
> > +	uint16_t nb_buf;
> > +
> > +	if (!rxq_buf_recycle_info)
> > +		return -EINVAL;
> 
> This is a fast path function. In which situation is this function called with
> rxq_buf_recycle_info == NULL?
> 
> If this function can genuinely be called with rxq_buf_recycle_info == NULL,
> you should test for (rxq_buf_recycle_info == NULL), not (!
> rxq_buf_recycle_info). Otherwise, I think
> RTE_ASSERT(rxq_buf_recycle_info != NULL) is more appropriate.
Agree. We should use ' RTE_ASSERT(rxq_buf_recycle_info != NULL)'.
> 
> > +
> > +	/* Stash Tx used buffers into Rx buffer ring */
> > +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> > +				rxq_buf_recycle_info);
> > +	/* If there are recycling buffers, refill Rx queue descriptors. */
> > +	if (nb_buf)
> > +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> > +					nb_buf);
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > index dcf8adab92..a138fd4dbc 100644
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > *rxq, uint16_t offset);
> >  /** @internal Check the status of a Tx descriptor */  typedef int
> > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> >
> > +/** @internal Stash Tx used buffers into RX ring in buffer recycle
> > +mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> > +
> > +/** @internal Refill Rx descriptors in buffer recycle mode */ typedef
> > +uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> 
> Please add proper function descriptions for the two callbacks above.
Ack.
> 
> > +
> >  /**
> >   * @internal
> >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> >  	eth_rx_queue_count_t rx_queue_count;
> >  	/** Check the status of a Rx descriptor. */
> >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > +	/** Refill Rx descriptors in buffer recycle mode */
> > +	eth_rx_descriptors_refill_t rx_descriptors_refill;
> >  	/** Rx queues data. */
> >  	struct rte_ethdev_qdata rxq;
> > -	uintptr_t reserved1[3];
> > +	uintptr_t reserved1[4];
> 
> You added a function pointer above, so to keep the structure alignment, you
> must remove one here, not add one:
> 
> -	uintptr_t reserved1[3];
> +	uintptr_t reserved1[2];
> 
Ack.
> >  	/**@}*/
> >
> >  	/**@{*/
> > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> >  	eth_tx_prep_t tx_pkt_prepare;
> >  	/** Check the status of a Tx descriptor. */
> >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> > +	eth_tx_buf_stash_t tx_buf_stash;
> >  	/** Tx queues data. */
> >  	struct rte_ethdev_qdata txq;
> > -	uintptr_t reserved2[3];
> > +	uintptr_t reserved2[4];
> 
> You added a function pointer above, so to keep the structure alignment, you
> must remove one here, not add one:
> 
> -	uintptr_t reserved1[3];
> +	uintptr_t reserved1[2];
> 
Ack.
> >  	/**@}*/
> >
> >  } __rte_cache_aligned;

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
                     ` (2 preceding siblings ...)
  2023-03-30  6:29   ` [PATCH v5 3/3] net/ixgbe: " Feifei Wang
@ 2023-03-30 15:04   ` Stephen Hemminger
  2023-04-03  2:48     ` Feifei Wang
  2023-04-19 14:56   ` Ferruh Yigit
  4 siblings, 1 reply; 67+ messages in thread
From: Stephen Hemminger @ 2023-03-30 15:04 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, konstantin.v.ananyev, mb, nd

On Thu, 30 Mar 2023 14:29:36 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:

> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as buffer
> recycle mode.


My naive reading of this is that lcore cache update is slow on ARM
so you are introducing yet another cache. Perhaps a better solution
would be to figure out/optimize the lcore cache to work better.

Adding another layer of abstraction is not going to help everyone
and the implementation you chose requires modifications to drivers
to get it to work.

In current form, this is not acceptable.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  9:31       ` Feifei Wang
@ 2023-03-30 15:15         ` Morten Brørup
  2023-03-30 15:58         ` Morten Brørup
  1 sibling, 0 replies; 67+ messages in thread
From: Morten Brørup @ 2023-03-30 15:15 UTC (permalink / raw)
  To: Feifei Wang, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

> From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> Sent: Thursday, 30 March 2023 11.31
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Thursday, March 30, 2023 3:19 PM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Thursday, 30 March 2023 08.30
> > >
> >
> > [...]
> >
> > > +/**
> > > + * @internal
> > > + * Rx routine for rte_eth_dev_buf_recycle().
> > > + * Refill Rx descriptors in buffer recycle mode.
> > > + *
> > > + * @note
> > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > + * Before calling this API, rte_eth_tx_buf_stash() should be
> > > + * called to stash Tx used buffers into Rx buffer ring.
> > > + *
> > > + * When this functionality is not implemented in the driver, the
> > > +return
> > > + * buffer number is 0.
> > > + *
> > > + * @param port_id
> > > + *   The port identifier of the Ethernet device.
> > > + * @param queue_id
> > > + *   The index of the receive queue.
> > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > supplied
> > > + *   to rte_eth_dev_configure().
> > > + *@param nb
> > > + *   The number of Rx descriptors to be refilled.
> > > + * @return
> > > + *   The number Rx descriptors correct to be refilled.
> > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> >
> > If you want errors reported by the return value, the function return type
> > cannot be uint16_t.
> Agree. Actually, in the code path, if errors happen, the function will return
> 0.
> For this description line, I refer to 'rte_eth_tx_prepare' notes. Maybe we
> should delete
> this line.
> 
> >
> > > + */
> > > +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> > > +		uint16_t queue_id, uint16_t nb)
> > > +{
> > > +	struct rte_eth_fp_ops *p;
> > > +	void *qd;
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > +		RTE_ETHDEV_LOG(ERR,
> > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > +			port_id, queue_id);
> > > +		rte_errno = ENODEV;
> > > +		return 0;
> >
> > If p->rx_descriptors_refill() is likely to return 0, this function should
> not use 0
> > as return value to indicate errors.
> However, refer to dpdk code style in ethdev, most of API write like this.
> For example, 'rte_eth_rx/tx_burst', 'rte_eth_tx_prep'.
> 
> I'm also confused what's return type for this due to I want
> to indicate errors and show the processed buffer number.

OK. Thanks for the references.

Looking at rte_eth_rx/tx_burst(), you could follow the same conventions here, i.e.:
- Use uint16_t as return type.
- Return 0 on error.
- Do not set rte_errno.
- Remove the "ENODEV" line from the @return description.
- Use RTE_ETHDEV_LOG(ERR,...) as the only method to indicate errors.

I now see that you follow the convention of rte_eth_tx_prepare(). This is also perfectly fine; then you just need to update the description of @return to mention that the error value is set in rte_errno if a value less than 'nb' is returned.

> 
> >
> > > +	}
> > > +#endif
> > > +
> > > +	p = &rte_eth_fp_ops[port_id];
> > > +	qd = p->rxq.data[queue_id];
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> > > +		rte_errno = ENODEV;
> > > +		return 0;
> > > +
> > > +	if (qd == NULL) {
> > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> > port_id=%u\n",
> > > +			queue_id, port_id);
> > > +		rte_errno = ENODEV;
> > > +		return 0;
> > > +	}
> > > +#endif
> > > +
> > > +	if (p->rx_descriptors_refill == NULL)
> > > +		return 0;
> > > +
> > > +	return p->rx_descriptors_refill(qd, nb); }

When does p->rx_descriptors_refill() return anything else than 'nb'?

If p->rx_descriptors_refill() always succeeds (and thus always returns 'nb'), you could make its return type void. And thus, you could also make the return type of rte_eth_rx_descriptors_refill() void.

> > > +
> > >  /**@{@name Rx hardware descriptor states
> > >   * @see rte_eth_rx_descriptor_status
> > >   */
> > > @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t
> > queue_id,
> > >  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);  }
> > >
> > > +/**
> > > + * @internal
> > > + * Tx routine for rte_eth_dev_buf_recycle().
> > > + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> > > + *
> > > + * @note
> > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > + * After calling this API, rte_eth_rx_descriptors_refill() should be
> > > + * called to refill Rx ring descriptors.
> > > + *
> > > + * When this functionality is not implemented in the driver, the
> > > +return
> > > + * buffer number is 0.
> > > + *
> > > + * @param port_id
> > > + *   The port identifier of the Ethernet device.
> > > + * @param queue_id
> > > + *   The index of the transmit queue.
> > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > supplied
> > > + *   to rte_eth_dev_configure().
> > > + * @param rxq_buf_recycle_info
> > > + *   A pointer to a structure of Rx queue buffer ring information in
> buffer
> > > + *   recycle mode.
> > > + *
> > > + * @return
> > > + *   The number buffers correct to be filled in the Rx buffer ring.
> > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> >
> > If you want errors reported by the return value, the function return type
> > cannot be uint16_t.

I now see that you follow the convention of rte_eth_tx_prepare() here too.

This is perfectly fine; then you just need to update the description of @return to mention that the error value is set in rte_errno if a value less than 'nb' is returned.

> >
> > > + */
> > > +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id,
> > > +uint16_t
> > > queue_id,
> > > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> > {
> > > +	struct rte_eth_fp_ops *p;
> > > +	void *qd;
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > +		RTE_ETHDEV_LOG(ERR,
> > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > +			port_id, queue_id);
> > > +		rte_errno = ENODEV;
> > > +		return 0;
> >
> > If p->tx_buf_stash() is likely to return 0, this function should not use 0
> as
> > return value to indicate errors.

I now see that you follow the convention of rte_eth_tx_prepare() here too. Then please ignore my comment about using 0 as return value on errors for this function.

> >
> > > +	}
> > > +#endif
> > > +
> > > +	p = &rte_eth_fp_ops[port_id];
> > > +	qd = p->txq.data[queue_id];
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> > > +		rte_errno = ENODEV;
> > > +		return 0;
> > > +
> > > +	if (qd == NULL) {
> > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> > port_id=%u\n",
> > > +			queue_id, port_id);
> > > +		rte_erno = ENODEV;
> > > +		return 0;
> > > +	}
> > > +#endif
> > > +
> > > +	if (p->tx_buf_stash == NULL)
> > > +		return 0;
> > > +
> > > +	return p->tx_buf_stash(qd, rxq_buf_recycle_info); }
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > > +notice
> > > + *
> > > + * Buffer recycle mode can let Tx queue directly put used buffers
> > > +into Rx
> > > buffer
> > > + * ring. This avoids freeing buffers into mempool and allocating
> > > + buffers from
> > > + * mempool.
> >
> > This function description generally describes the buffer recycle mode.
> >
> > Please update it to describe what this specific function does.
> Ack.
> >
> > > + *
> > > + * @param rx_port_id
> > > + *   Port identifying the receive side.
> > > + * @param rx_queue_id
> > > + *   The index of the receive queue identifying the receive side.
> > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > supplied
> > > + *   to rte_eth_dev_configure().
> > > + * @param tx_port_id
> > > + *   Port identifying the transmit side.
> > > + * @param tx_queue_id
> > > + *   The index of the transmit queue identifying the transmit side.
> > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > supplied
> > > + *   to rte_eth_dev_configure().
> > > + * @param rxq_recycle_info
> > > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be
> filled.
> > > + * @return
> > > + *   - (0) on success or no recycling buffer.
> >
> > Why not return the return value of rte_eth_rx_descriptors_refill() instead
> of
> > 0 on success? (This is a question, not a suggestion.)
> >
> > Or, if rxq_buf_recycle_info must be valid, the function return type could be
> > void instead of int.
> >
> In some time, users may forget to allocate the room for ' rxq_buf_recycle_info
> '
> and call 'rte_rxq_buf_recycle_info_get' API. Thus, I think we need to check
> with this.

If the user forgets to allocate the rxq_buf_recycle_info, it is a serious bug in the user's application.

We don't need to handle such bugs at runtime.

> 
> Furthermore, the return value of this API should return success or not.

If rxq_buf_recycle_info is not NULL, this function will always succeed. So there is no need for a return value.

If you want this function to return something, it could return nb_buf, so the application can it use for telemetry purposes or similar.

> 
> > > + *   - (-EINVAL) rxq_recycle_info is NULL.
> > > + */
> > > +__rte_experimental
> > > +static inline int
> > > +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> > {
> > > +	/* The number of recycling buffers. */
> > > +	uint16_t nb_buf;
> > > +
> > > +	if (!rxq_buf_recycle_info)
> > > +		return -EINVAL;
> >
> > This is a fast path function. In which situation is this function called
> with
> > rxq_buf_recycle_info == NULL?
> >
> > If this function can genuinely be called with rxq_buf_recycle_info == NULL,
> > you should test for (rxq_buf_recycle_info == NULL), not (!
> > rxq_buf_recycle_info). Otherwise, I think
> > RTE_ASSERT(rxq_buf_recycle_info != NULL) is more appropriate.
> Agree. We should use ' RTE_ASSERT(rxq_buf_recycle_info != NULL)'.
> >
> > > +
> > > +	/* Stash Tx used buffers into Rx buffer ring */
> > > +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> > > +				rxq_buf_recycle_info);
> > > +	/* If there are recycling buffers, refill Rx queue descriptors. */
> > > +	if (nb_buf)
> > > +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> > > +					nb_buf);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  /**
> > >   * @warning
> > >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > > index dcf8adab92..a138fd4dbc 100644
> > > --- a/lib/ethdev/rte_ethdev_core.h
> > > +++ b/lib/ethdev/rte_ethdev_core.h
> > > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > > *rxq, uint16_t offset);
> > >  /** @internal Check the status of a Tx descriptor */  typedef int
> > > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> > >
> > > +/** @internal Stash Tx used buffers into RX ring in buffer recycle
> > > +mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> > > +
> > > +/** @internal Refill Rx descriptors in buffer recycle mode */ typedef
> > > +uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> >
> > Please add proper function descriptions for the two callbacks above.
> Ack.
> >
> > > +
> > >  /**
> > >   * @internal
> > >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> > >  	eth_rx_queue_count_t rx_queue_count;
> > >  	/** Check the status of a Rx descriptor. */
> > >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > > +	/** Refill Rx descriptors in buffer recycle mode */
> > > +	eth_rx_descriptors_refill_t rx_descriptors_refill;
> > >  	/** Rx queues data. */
> > >  	struct rte_ethdev_qdata rxq;
> > > -	uintptr_t reserved1[3];
> > > +	uintptr_t reserved1[4];
> >
> > You added a function pointer above, so to keep the structure alignment, you
> > must remove one here, not add one:
> >
> > -	uintptr_t reserved1[3];
> > +	uintptr_t reserved1[2];
> >
> Ack.
> > >  	/**@}*/
> > >
> > >  	/**@{*/
> > > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> > >  	eth_tx_prep_t tx_pkt_prepare;
> > >  	/** Check the status of a Tx descriptor. */
> > >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > > +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> > > +	eth_tx_buf_stash_t tx_buf_stash;
> > >  	/** Tx queues data. */
> > >  	struct rte_ethdev_qdata txq;
> > > -	uintptr_t reserved2[3];
> > > +	uintptr_t reserved2[4];
> >
> > You added a function pointer above, so to keep the structure alignment, you
> > must remove one here, not add one:
> >
> > -	uintptr_t reserved1[3];
> > +	uintptr_t reserved1[2];
> >
> Ack.
> > >  	/**@}*/
> > >
> > >  } __rte_cache_aligned;


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  9:31       ` Feifei Wang
  2023-03-30 15:15         ` Morten Brørup
@ 2023-03-30 15:58         ` Morten Brørup
  2023-04-26  6:59           ` Feifei Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Morten Brørup @ 2023-03-30 15:58 UTC (permalink / raw)
  To: Feifei Wang, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

> From: Morten Brørup
> Sent: Thursday, 30 March 2023 17.15
> 
> > From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> > Sent: Thursday, 30 March 2023 11.31
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Thursday, March 30, 2023 3:19 PM
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Thursday, 30 March 2023 08.30
> > > >
> > >
> > > [...]
> > >
> > > > +/**
> > > > + * @internal
> > > > + * Rx routine for rte_eth_dev_buf_recycle().
> > > > + * Refill Rx descriptors in buffer recycle mode.
> > > > + *
> > > > + * @note
> > > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > > + * Before calling this API, rte_eth_tx_buf_stash() should be
> > > > + * called to stash Tx used buffers into Rx buffer ring.
> > > > + *
> > > > + * When this functionality is not implemented in the driver, the
> > > > +return
> > > > + * buffer number is 0.
> > > > + *
> > > > + * @param port_id
> > > > + *   The port identifier of the Ethernet device.
> > > > + * @param queue_id
> > > > + *   The index of the receive queue.
> > > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > > supplied
> > > > + *   to rte_eth_dev_configure().
> > > > + *@param nb
> > > > + *   The number of Rx descriptors to be refilled.
> > > > + * @return
> > > > + *   The number Rx descriptors correct to be refilled.
> > > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> > >
> > > If you want errors reported by the return value, the function return type
> > > cannot be uint16_t.
> > Agree. Actually, in the code path, if errors happen, the function will
> return
> > 0.
> > For this description line, I refer to 'rte_eth_tx_prepare' notes. Maybe we
> > should delete
> > this line.
> >
> > >
> > > > + */
> > > > +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> > > > +		uint16_t queue_id, uint16_t nb)
> > > > +{
> > > > +	struct rte_eth_fp_ops *p;
> > > > +	void *qd;
> > > > +
> > > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > > +		RTE_ETHDEV_LOG(ERR,
> > > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > > +			port_id, queue_id);
> > > > +		rte_errno = ENODEV;
> > > > +		return 0;
> > >
> > > If p->rx_descriptors_refill() is likely to return 0, this function should
> > not use 0
> > > as return value to indicate errors.
> > However, refer to dpdk code style in ethdev, most of API write like this.
> > For example, 'rte_eth_rx/tx_burst', 'rte_eth_tx_prep'.
> >
> > I'm also confused what's return type for this due to I want
> > to indicate errors and show the processed buffer number.
> 
> OK. Thanks for the references.
> 
> Looking at rte_eth_rx/tx_burst(), you could follow the same conventions here,
> i.e.:
> - Use uint16_t as return type.
> - Return 0 on error.
> - Do not set rte_errno.
> - Remove the "ENODEV" line from the @return description.
> - Use RTE_ETHDEV_LOG(ERR,...) as the only method to indicate errors.
> 
> I now see that you follow the convention of rte_eth_tx_prepare(). This is also
> perfectly fine; then you just need to update the description of @return to
> mention that the error value is set in rte_errno if a value less than 'nb' is
> returned.

After further consideration, I have changed my mind:

The primary purpose of rte_eth_tx_prepare() is to test if a packet burst is valid, so the ability to return an error value is a natural requirement.

This is not the purpose your functions. The purpose of your functions resemble rte_eth_rx/tx_burst(), where there is no requirement to return an error value. So you should follow the convention of rte_eth_rx/tx_burst(), as I just suggested.

> 
> >
> > >
> > > > +	}
> > > > +#endif
> > > > +
> > > > +	p = &rte_eth_fp_ops[port_id];
> > > > +	qd = p->rxq.data[queue_id];
> > > > +
> > > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n", port_id);
> > > > +		rte_errno = ENODEV;
> > > > +		return 0;
> > > > +
> > > > +	if (qd == NULL) {
> > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> > > port_id=%u\n",
> > > > +			queue_id, port_id);
> > > > +		rte_errno = ENODEV;
> > > > +		return 0;
> > > > +	}
> > > > +#endif
> > > > +
> > > > +	if (p->rx_descriptors_refill == NULL)
> > > > +		return 0;
> > > > +
> > > > +	return p->rx_descriptors_refill(qd, nb); }
> 
> When does p->rx_descriptors_refill() return anything else than 'nb'?
> 
> If p->rx_descriptors_refill() always succeeds (and thus always returns 'nb'),
> you could make its return type void. And thus, you could also make the return
> type of rte_eth_rx_descriptors_refill() void.
> 
> > > > +
> > > >  /**@{@name Rx hardware descriptor states
> > > >   * @see rte_eth_rx_descriptor_status
> > > >   */
> > > > @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t
> > > queue_id,
> > > >  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);  }
> > > >
> > > > +/**
> > > > + * @internal
> > > > + * Tx routine for rte_eth_dev_buf_recycle().
> > > > + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> > > > + *
> > > > + * @note
> > > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > > + * After calling this API, rte_eth_rx_descriptors_refill() should be
> > > > + * called to refill Rx ring descriptors.
> > > > + *
> > > > + * When this functionality is not implemented in the driver, the
> > > > +return
> > > > + * buffer number is 0.
> > > > + *
> > > > + * @param port_id
> > > > + *   The port identifier of the Ethernet device.
> > > > + * @param queue_id
> > > > + *   The index of the transmit queue.
> > > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > > supplied
> > > > + *   to rte_eth_dev_configure().
> > > > + * @param rxq_buf_recycle_info
> > > > + *   A pointer to a structure of Rx queue buffer ring information in
> > buffer
> > > > + *   recycle mode.
> > > > + *
> > > > + * @return
> > > > + *   The number buffers correct to be filled in the Rx buffer ring.
> > > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> > >
> > > If you want errors reported by the return value, the function return type
> > > cannot be uint16_t.
> 
> I now see that you follow the convention of rte_eth_tx_prepare() here too.
> 
> This is perfectly fine; then you just need to update the description of
> @return to mention that the error value is set in rte_errno if a value less
> than 'nb' is returned.

Same comment about conventions as above: This function is more like rte_eth_rx/tx_burst() than rte_eth_tx_prepare(), so follow the convention of rte_eth_rx/tx_burst() instead.

> 
> > >
> > > > + */
> > > > +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id,
> > > > +uint16_t
> > > > queue_id,
> > > > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> > > {
> > > > +	struct rte_eth_fp_ops *p;
> > > > +	void *qd;
> > > > +
> > > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > > +		RTE_ETHDEV_LOG(ERR,
> > > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > > +			port_id, queue_id);
> > > > +		rte_errno = ENODEV;
> > > > +		return 0;
> > >
> > > If p->tx_buf_stash() is likely to return 0, this function should not use 0
> > as
> > > return value to indicate errors.
> 
> I now see that you follow the convention of rte_eth_tx_prepare() here too.
> Then please ignore my comment about using 0 as return value on errors for this
> function.

Same comment about conventions as above: This function is more like rte_eth_rx/tx_burst() than rte_eth_tx_prepare(), so follow the convention of rte_eth_rx/tx_burst() instead.

> 
> > >
> > > > +	}
> > > > +#endif
> > > > +
> > > > +	p = &rte_eth_fp_ops[port_id];
> > > > +	qd = p->txq.data[queue_id];
> > > > +
> > > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n", port_id);
> > > > +		rte_errno = ENODEV;
> > > > +		return 0;
> > > > +
> > > > +	if (qd == NULL) {
> > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> > > port_id=%u\n",
> > > > +			queue_id, port_id);
> > > > +		rte_erno = ENODEV;
> > > > +		return 0;
> > > > +	}
> > > > +#endif
> > > > +
> > > > +	if (p->tx_buf_stash == NULL)
> > > > +		return 0;
> > > > +
> > > > +	return p->tx_buf_stash(qd, rxq_buf_recycle_info); }
> > > > +
> > > > +/**
> > > > + * @warning
> > > > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > > > +notice
> > > > + *
> > > > + * Buffer recycle mode can let Tx queue directly put used buffers
> > > > +into Rx
> > > > buffer
> > > > + * ring. This avoids freeing buffers into mempool and allocating
> > > > + buffers from
> > > > + * mempool.
> > >
> > > This function description generally describes the buffer recycle mode.
> > >
> > > Please update it to describe what this specific function does.
> > Ack.
> > >
> > > > + *
> > > > + * @param rx_port_id
> > > > + *   Port identifying the receive side.
> > > > + * @param rx_queue_id
> > > > + *   The index of the receive queue identifying the receive side.
> > > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > > supplied
> > > > + *   to rte_eth_dev_configure().
> > > > + * @param tx_port_id
> > > > + *   Port identifying the transmit side.
> > > > + * @param tx_queue_id
> > > > + *   The index of the transmit queue identifying the transmit side.
> > > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > > supplied
> > > > + *   to rte_eth_dev_configure().
> > > > + * @param rxq_recycle_info
> > > > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be
> > filled.
> > > > + * @return
> > > > + *   - (0) on success or no recycling buffer.
> > >
> > > Why not return the return value of rte_eth_rx_descriptors_refill() instead
> > of
> > > 0 on success? (This is a question, not a suggestion.)
> > >
> > > Or, if rxq_buf_recycle_info must be valid, the function return type could
> be
> > > void instead of int.
> > >
> > In some time, users may forget to allocate the room for '
> rxq_buf_recycle_info
> > '
> > and call 'rte_rxq_buf_recycle_info_get' API. Thus, I think we need to check
> > with this.
> 
> If the user forgets to allocate the rxq_buf_recycle_info, it is a serious bug
> in the user's application.
> 
> We don't need to handle such bugs at runtime.
> 
> >
> > Furthermore, the return value of this API should return success or not.
> 
> If rxq_buf_recycle_info is not NULL, this function will always succeed. So
> there is no need for a return value.
> 
> If you want this function to return something, it could return nb_buf, so the
> application can it use for telemetry purposes or similar.
> 
> >
> > > > + *   - (-EINVAL) rxq_recycle_info is NULL.
> > > > + */
> > > > +__rte_experimental
> > > > +static inline int
> > > > +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > > > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> > > {
> > > > +	/* The number of recycling buffers. */
> > > > +	uint16_t nb_buf;
> > > > +
> > > > +	if (!rxq_buf_recycle_info)
> > > > +		return -EINVAL;
> > >
> > > This is a fast path function. In which situation is this function called
> > with
> > > rxq_buf_recycle_info == NULL?
> > >
> > > If this function can genuinely be called with rxq_buf_recycle_info ==
> NULL,
> > > you should test for (rxq_buf_recycle_info == NULL), not (!
> > > rxq_buf_recycle_info). Otherwise, I think
> > > RTE_ASSERT(rxq_buf_recycle_info != NULL) is more appropriate.
> > Agree. We should use ' RTE_ASSERT(rxq_buf_recycle_info != NULL)'.
> > >
> > > > +
> > > > +	/* Stash Tx used buffers into Rx buffer ring */
> > > > +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> > > > +				rxq_buf_recycle_info);
> > > > +	/* If there are recycling buffers, refill Rx queue descriptors.
> */
> > > > +	if (nb_buf)
> > > > +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> > > > +					nb_buf);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  /**
> > > >   * @warning
> > > >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > > > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > > > index dcf8adab92..a138fd4dbc 100644
> > > > --- a/lib/ethdev/rte_ethdev_core.h
> > > > +++ b/lib/ethdev/rte_ethdev_core.h
> > > > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > > > *rxq, uint16_t offset);
> > > >  /** @internal Check the status of a Tx descriptor */  typedef int
> > > > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> > > >
> > > > +/** @internal Stash Tx used buffers into RX ring in buffer recycle
> > > > +mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > > > +		struct rte_eth_rxq_buf_recycle_info
> *rxq_buf_recycle_info);
> > > > +
> > > > +/** @internal Refill Rx descriptors in buffer recycle mode */ typedef
> > > > +uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> > >
> > > Please add proper function descriptions for the two callbacks above.
> > Ack.
> > >
> > > > +
> > > >  /**
> > > >   * @internal
> > > >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > > > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> > > >  	eth_rx_queue_count_t rx_queue_count;
> > > >  	/** Check the status of a Rx descriptor. */
> > > >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > > > +	/** Refill Rx descriptors in buffer recycle mode */
> > > > +	eth_rx_descriptors_refill_t rx_descriptors_refill;
> > > >  	/** Rx queues data. */
> > > >  	struct rte_ethdev_qdata rxq;
> > > > -	uintptr_t reserved1[3];
> > > > +	uintptr_t reserved1[4];
> > >
> > > You added a function pointer above, so to keep the structure alignment,
> you
> > > must remove one here, not add one:
> > >
> > > -	uintptr_t reserved1[3];
> > > +	uintptr_t reserved1[2];
> > >
> > Ack.
> > > >  	/**@}*/
> > > >
> > > >  	/**@{*/
> > > > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> > > >  	eth_tx_prep_t tx_pkt_prepare;
> > > >  	/** Check the status of a Tx descriptor. */
> > > >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > > > +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> > > > +	eth_tx_buf_stash_t tx_buf_stash;
> > > >  	/** Tx queues data. */
> > > >  	struct rte_ethdev_qdata txq;
> > > > -	uintptr_t reserved2[3];
> > > > +	uintptr_t reserved2[4];
> > >
> > > You added a function pointer above, so to keep the structure alignment,
> you
> > > must remove one here, not add one:
> > >
> > > -	uintptr_t reserved1[3];
> > > +	uintptr_t reserved1[2];
> > >
> > Ack.
> > > >  	/**@}*/
> > > >
> > > >  } __rte_cache_aligned;


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 0/3] Recycle buffers from Tx to Rx
  2023-03-30 15:04   ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Stephen Hemminger
@ 2023-04-03  2:48     ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-04-03  2:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, konstantin.v.ananyev, mb, nd, nd

Thanks for the reviewing.

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, March 30, 2023 11:05 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>
> Subject: Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
> 
> On Thu, 30 Mar 2023 14:29:36 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
> 
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache. And we can call this mode
> > as buffer recycle mode.
> 
> 
> My naive reading of this is that lcore cache update is slow on ARM so you are
> introducing yet another cache. Perhaps a better solution would be to figure
> out/optimize the lcore cache to work better.

From my point of view, 'recycle buffer' is a strategic optimization. It reduces the operation
of a buffer. Not only arm, but also x86 and other  architecture can benefit from this. For example,
we can see x86 sse path performance improvement in cover letter test results.

> 
> Adding another layer of abstraction is not going to help everyone and the
> implementation you chose requires modifications to drivers to get it to work.
> 
We did not change the original driver mechanism. Recycle buffer can be looked
at a feature for pmd, if the user needs  higher performance, he/she can choose to
call the API in the application to enable it.

> In current form, this is not acceptable.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
  2023-03-30  7:19     ` Morten Brørup
@ 2023-04-19 14:46     ` Ferruh Yigit
  2023-04-26  7:29       ` Feifei Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Ferruh Yigit @ 2023-04-19 14:46 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, mb, nd, Honnappa Nagarahalli, Ruifeng Wang

On 3/30/2023 7:29 AM, Feifei Wang wrote:
> There are 4 upper APIs for buffer recycle mode:
> 1. 'rte_eth_rx_queue_buf_recycle_info_get'
> This is to retrieve buffer ring information about given ports's Rx
> queue in buffer recycle mode. And due to this, buffer recycle can
> be no longer limited to the same driver in Rx and Tx.
> 
> 2. 'rte_eth_dev_buf_recycle'
> Users can call this API to enable buffer recycle mode in data path.
> There are 2 internal APIs in it, which is separately for Rx and TX.
> 

Overall, can we have a namespace in the functions related to the buffer
recycle, to clarify API usage, something like (just putting as sample to
clarify my point):

rte_eth_recycle_buf
rte_eth_recycle_tx_buf_stash
rte_eth_recycle_rx_descriptors_refill
rte_eth_recycle_rx_queue_info_get


> 3. 'rte_eth_tx_buf_stash'
> Internal API for buffer recycle mode. This is to stash Tx used
> buffers into Rx buffer ring.
> 

This API is to move/recycle descriptors from Tx queue to Rx queue, but
name on its own, 'rte_eth_tx_buf_stash', reads like we are stashing
something to Tx queue. What do you think, can naming be improved?

> 4. 'rte_eth_rx_descriptors_refill'
> Internal API for buffer recycle mode. This is to refill Rx
> descriptors.
> 
> Above all APIs are just implemented at the upper level.
> For different APIs, we need to define specific functions separately.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

<...>

>  
> +int
> +rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id, uint16_t queue_id,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (queue_id >= dev->data->nb_rx_queues) {
> +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
> +		return -EINVAL;
> +	}
> +
> +	RTE_ASSERT(rxq_buf_recycle_info != NULL);
> +

This is slow path API, I think better to validate parameter and return
an error instead of assert().

<...>

> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
>  /** @internal Check the status of a Tx descriptor */
>  typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
>  
> +/** @internal Stash Tx used buffers into RX ring in buffer recycle mode */
> +typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> +
> +/** @internal Refill Rx descriptors in buffer recycle mode */
> +typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> +

Since there is only single API exposed to the application, is it really
required to have two dev_ops, why not have a single one?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v5 3/3] net/ixgbe: implement recycle buffer mode
  2023-03-30  6:29   ` [PATCH v5 3/3] net/ixgbe: " Feifei Wang
@ 2023-04-19 14:46     ` Ferruh Yigit
  2023-04-26  7:36       ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Ferruh Yigit @ 2023-04-19 14:46 UTC (permalink / raw)
  To: Feifei Wang, Qiming Yang, Wenjun Wu
  Cc: dev, konstantin.v.ananyev, mb, nd, Honnappa Nagarahalli, Ruifeng Wang

On 3/30/2023 7:29 AM, Feifei Wang wrote:
> Define specific function implementation for ixgbe driver.
> Currently, recycle buffer mode can support 128bit
> vector path. And can be enabled both in fast free and
> no fast free mode.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 153 +++++++++++++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
>  4 files changed, 161 insertions(+)
> 

What do you think to extract buf_recycle related code in drivers into
its own file, this may help to manager maintainership of code easier?

<...>

> +uint16_t
> +ixgbe_tx_buf_stash_vec(void *tx_queue,
> +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info)
> +{
> +	struct ixgbe_tx_queue *txq = tx_queue;
> +	struct ixgbe_tx_entry *txep;
> +	struct rte_mbuf **rxep;
> +	struct rte_mbuf *m[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
> +	int i, j, n;
> +	uint32_t status;
> +	uint16_t avail = 0;
> +	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
> +	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
> +	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
> +	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
> +	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
> +
> +	/* Get available recycling Rx buffers. */
> +	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
> +
> +	/* Check Tx free thresh and Rx available space. */
> +	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
> +		return 0;
> +
> +	/* check DD bits on threshold descriptor */
> +	status = txq->tx_ring[txq->tx_next_dd].wb.status;
> +	if (!(status & IXGBE_ADVTXD_STAT_DD))
> +		return 0;
> +
> +	n = txq->tx_rs_thresh;
> +
> +	/* Buffer recycle can only support no ring buffer wraparound.
> +	 * Two case for this:
> +	 *
> +	 * case 1: The refill head of Rx buffer ring needs to be aligned with
> +	 * buffer ring size. In this case, the number of Tx freeing buffers
> +	 * should be equal to refill_request.
> +	 *
> +	 * case 2: The refill head of Rx ring buffer does not need to be aligned
> +	 * with buffer ring size. In this case, the update of refill head can not
> +	 * exceed the Rx buffer ring size.
> +	 */
> +	if (refill_request != n ||
> +		(!refill_request && (refill_head + n > buf_ring_size)))
> +		return 0;
> +
> +	/* First buffer to free from S/W ring is at index
> +	 * tx_next_dd - (tx_rs_thresh-1).
> +	 */
> +	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +	rxep = rxq_buf_recycle_info->buf_ring;
> +	rxep += refill_head;
> +
> +	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +		/* Directly put mbufs from Tx to Rx. */
> +		for (i = 0; i < n; i++, rxep++, txep++)
> +			*rxep = txep[0].mbuf;
> +	} else {
> +		for (i = 0, j = 0; i < n; i++) {
> +			/* Avoid txq contains buffers from expected mempoo. */

mempool (unless trying to introduce a new concept :)

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
                     ` (3 preceding siblings ...)
  2023-03-30 15:04   ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Stephen Hemminger
@ 2023-04-19 14:56   ` Ferruh Yigit
  2023-04-25  7:57     ` Feifei Wang
  4 siblings, 1 reply; 67+ messages in thread
From: Ferruh Yigit @ 2023-04-19 14:56 UTC (permalink / raw)
  To: Feifei Wang, Qi Z Zhang, Mcnamara, John; +Cc: dev, konstantin.v.ananyev, mb, nd

On 3/30/2023 7:29 AM, Feifei Wang wrote:
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache. And we can call this mode as buffer
> recycle mode.
> 
> In the latest version, buffer recycle mode is packaged as a separate API. 
> This allows for the users to change rxq/txq pairing in real time in data plane,
> according to the analysis of the packet flow by the application, for example:
> -----------------------------------------------------------------------
> Step 1: upper application analyse the flow direction
> Step 2: rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid, rx_queueid)
> Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid, tx_portid, tx_queueid, rxq_buf_recycle_info);
> Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
> Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> -----------------------------------------------------------------------
> Above can support user to change rxq/txq pairing  at runtime and user does not need to
> know the direction of flow in advance. This can effectively expand buffer recycle mode's
> use scenarios.
> 
> Furthermore, buffer recycle mode is no longer limited to the same pmd,
> it can support moving buffers between different vendor pmds, even can put the buffer
> anywhere into your Rx buffer ring as long as the address of the buffer ring can be provided.
> In the latest version, we enable direct-rearm in i40e pmd and ixgbe pmd, and also try to
> use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
> by buffer recycle mode.
> 
> Difference between buffer recycle, ZC API used in mempool and general path
> For general path: 
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
> For ZC API used in mempool:
>                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
>                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
>                 Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
> For buffer recycle:
>                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> Thus we can see in the one loop, compared to general path, buffer recycle reduce 32+32=64 pkts memcpy;
> Compared to ZC API used in mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
> So, buffer recycle has its own benefits.
> 
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
>     port 0: 82599 NIC   port 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Thunderx2:      +7.53%	                +13.54%
> -------------------------------------------------------------
> 
> (2) dpdk l3fwd test with same driver:
>     port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> 		Without fast free	With fast free
> Ampere altra:   +12.61%		        +11.42%
> n1sdp:		+8.30%			+3.85%
> x86-sse:	+8.43%			+3.72%
> -------------------------------------------------------------
> 
> (3) Performance comparison with ZC_mempool used
>     port 0 && 1: XL710 NIC
>     with fast free
> -------------------------------------------------------------
> 		With recycle buffer	With zc_mempool
> Ampere altra:	11.42%			3.54%
> -------------------------------------------------------------
> 

Thanks for the perf test reports.

Since test is done on Intel NICs, it would be great to get some testing
and performance numbers from Intel side too, if possible.

> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
> 
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
> 
> v4:
> 1. Rename direct-rearm as buffer recycle. Based on this, function name
> and variable name are changed to let this mode more general for all
> drivers. (Konstantin, Morten)
> 2. Add ring wrapping check (Konstantin)
> 
> v5:
> 1. some change for ethdev API (Morten)
> 2. add support for avx2, sse, altivec path
> 
> Feifei Wang (3):
>   ethdev: add API for buffer recycle mode
>   net/i40e: implement recycle buffer mode
>   net/ixgbe: implement recycle buffer mode
> 
>  drivers/net/i40e/i40e_ethdev.c   |   1 +
>  drivers/net/i40e/i40e_ethdev.h   |   2 +
>  drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
>  drivers/net/i40e/i40e_rxtx.h     |   4 +
>  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
>  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
>  lib/ethdev/ethdev_driver.h       |  10 ++
>  lib/ethdev/ethdev_private.c      |   2 +
>  lib/ethdev/rte_ethdev.c          |  33 +++++
>  lib/ethdev/rte_ethdev.h          | 230 +++++++++++++++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h     |  15 +-
>  lib/ethdev/version.map           |   6 +
>  14 files changed, 621 insertions(+), 2 deletions(-)
> 

Is usage sample of these new APIs planned? Can it be a new forwarding
mode in testpmd?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 0/3] Recycle buffers from Tx to Rx
  2023-04-19 14:56   ` Ferruh Yigit
@ 2023-04-25  7:57     ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-04-25  7:57 UTC (permalink / raw)
  To: Ferruh Yigit, Qi Z Zhang, Mcnamara, John
  Cc: dev, konstantin.v.ananyev, mb, nd, nd



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Wednesday, April 19, 2023 10:56 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Qi Z Zhang
> <qi.z.zhang@intel.com>; Mcnamara, John <john.mcnamara@intel.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>
> Subject: Re: [PATCH v5 0/3] Recycle buffers from Tx to Rx
> 
> On 3/30/2023 7:29 AM, Feifei Wang wrote:
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache. And we can call this mode
> > as buffer recycle mode.
> >
> > In the latest version, buffer recycle mode is packaged as a separate API.
> > This allows for the users to change rxq/txq pairing in real time in
> > data plane, according to the analysis of the packet flow by the application,
> for example:
> > ----------------------------------------------------------------------
> > - Step 1: upper application analyse the flow direction Step 2:
> > rxq_buf_recycle_info = rte_eth_rx_buf_recycle_info_get(rx_portid,
> > rx_queueid) Step 3: rte_eth_dev_buf_recycle(rx_portid, rx_queueid,
> > tx_portid, tx_queueid, rxq_buf_recycle_info); Step 4:
> > rte_eth_rx_burst(rx_portid,rx_queueid);
> > Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
> > ----------------------------------------------------------------------
> > - Above can support user to change rxq/txq pairing  at runtime and
> > user does not need to know the direction of flow in advance. This can
> > effectively expand buffer recycle mode's use scenarios.
> >
> > Furthermore, buffer recycle mode is no longer limited to the same pmd,
> > it can support moving buffers between different vendor pmds, even can
> > put the buffer anywhere into your Rx buffer ring as long as the address of the
> buffer ring can be provided.
> > In the latest version, we enable direct-rearm in i40e pmd and ixgbe
> > pmd, and also try to use i40e driver in Rx, ixgbe driver in Tx, and
> > then achieve 7-9% performance improvement by buffer recycle mode.
> >
> > Difference between buffer recycle, ZC API used in mempool and general
> > path For general path:
> >                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
> >                 Tx: 32 pkts memcpy from tx_sw_ring to temporary
> > variable + 32 pkts memcpy from temporary variable to mempool cache For
> ZC API used in mempool:
> >                 Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
> >                 Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
> >                 Refer link:
> > http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-
> kama
> > lakshitha.aligeri@arm.com/
> > For buffer recycle:
> >                 Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
> > Thus we can see in the one loop, compared to general path, buffer
> > recycle reduce 32+32=64 pkts memcpy; Compared to ZC API used in
> mempool, we can see buffer recycle reduce 32 pkts memcpy in each loop.
> > So, buffer recycle has its own benefits.
> >
> > Testing status:
> > (1) dpdk l3fwd test with multiple drivers:
> >     port 0: 82599 NIC   port 1: XL710 NIC
> > -------------------------------------------------------------
> > 		Without fast free	With fast free
> > Thunderx2:      +7.53%	                +13.54%
> > -------------------------------------------------------------
> >
> > (2) dpdk l3fwd test with same driver:
> >     port 0 && 1: XL710 NIC
> > -------------------------------------------------------------
> > 		Without fast free	With fast free
> > Ampere altra:   +12.61%		        +11.42%
> > n1sdp:		+8.30%			+3.85%
> > x86-sse:	+8.43%			+3.72%
> > -------------------------------------------------------------
> >
> > (3) Performance comparison with ZC_mempool used
> >     port 0 && 1: XL710 NIC
> >     with fast free
> > -------------------------------------------------------------
> > 		With recycle buffer	With zc_mempool
> > Ampere altra:	11.42%			3.54%
> > -------------------------------------------------------------
> >
> 
> Thanks for the perf test reports.
> 
> Since test is done on Intel NICs, it would be great to get some testing and
> performance numbers from Intel side too, if possible.

Thanks for the reviewing.
Actually, we have done the test in x86. From the performance number above,
It shows in x86-sse path, buffer recycle can improve performance by 3.72% ~ 8.43%.

> 
> > V2:
> > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2.
> > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use
> > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add
> > condition detection for direct rearm API (Morten, Andrew Rybchenko)
> >
> > V3:
> > 1. Seperate Rx and Tx operation with two APIs in direct-rearm
> > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable
> > direct rearm in ixgbe driver in Arm
> >
> > v4:
> > 1. Rename direct-rearm as buffer recycle. Based on this, function name
> > and variable name are changed to let this mode more general for all
> > drivers. (Konstantin, Morten) 2. Add ring wrapping check (Konstantin)
> >
> > v5:
> > 1. some change for ethdev API (Morten) 2. add support for avx2, sse,
> > altivec path
> >
> > Feifei Wang (3):
> >   ethdev: add API for buffer recycle mode
> >   net/i40e: implement recycle buffer mode
> >   net/ixgbe: implement recycle buffer mode
> >
> >  drivers/net/i40e/i40e_ethdev.c   |   1 +
> >  drivers/net/i40e/i40e_ethdev.h   |   2 +
> >  drivers/net/i40e/i40e_rxtx.c     | 159 +++++++++++++++++++++
> >  drivers/net/i40e/i40e_rxtx.h     |   4 +
> >  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
> >  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
> >  drivers/net/ixgbe/ixgbe_rxtx.c   | 153 ++++++++++++++++++++
> >  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
> >  lib/ethdev/ethdev_driver.h       |  10 ++
> >  lib/ethdev/ethdev_private.c      |   2 +
> >  lib/ethdev/rte_ethdev.c          |  33 +++++
> >  lib/ethdev/rte_ethdev.h          | 230
> +++++++++++++++++++++++++++++++
> >  lib/ethdev/rte_ethdev_core.h     |  15 +-
> >  lib/ethdev/version.map           |   6 +
> >  14 files changed, 621 insertions(+), 2 deletions(-)
> >
> 
> Is usage sample of these new APIs planned? Can it be a new forwarding mode
> in testpmd?

Agree. Following the discussion in Tech Board meeting, we will add buffer recycle into testpmd fwd engine.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-03-30 15:58         ` Morten Brørup
@ 2023-04-26  6:59           ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-04-26  6:59 UTC (permalink / raw)
  To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli,
	Ruifeng Wang, nd, nd



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, March 30, 2023 11:59 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Ferruh
> Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
> 
> > From: Morten Brørup
> > Sent: Thursday, 30 March 2023 17.15
> >
> > > From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> > > Sent: Thursday, 30 March 2023 11.31
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: Thursday, March 30, 2023 3:19 PM
> > > >
> > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > > Sent: Thursday, 30 March 2023 08.30
> > > > >
> > > >
> > > > [...]
> > > >
> > > > > +/**
> > > > > + * @internal
> > > > > + * Rx routine for rte_eth_dev_buf_recycle().
> > > > > + * Refill Rx descriptors in buffer recycle mode.
> > > > > + *
> > > > > + * @note
> > > > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > > > + * Before calling this API, rte_eth_tx_buf_stash() should be
> > > > > + * called to stash Tx used buffers into Rx buffer ring.
> > > > > + *
> > > > > + * When this functionality is not implemented in the driver,
> > > > > +the return
> > > > > + * buffer number is 0.
> > > > > + *
> > > > > + * @param port_id
> > > > > + *   The port identifier of the Ethernet device.
> > > > > + * @param queue_id
> > > > > + *   The index of the receive queue.
> > > > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > > > supplied
> > > > > + *   to rte_eth_dev_configure().
> > > > > + *@param nb
> > > > > + *   The number of Rx descriptors to be refilled.
> > > > > + * @return
> > > > > + *   The number Rx descriptors correct to be refilled.
> > > > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> > > >
> > > > If you want errors reported by the return value, the function
> > > > return type cannot be uint16_t.
> > > Agree. Actually, in the code path, if errors happen, the function
> > > will
> > return
> > > 0.
> > > For this description line, I refer to 'rte_eth_tx_prepare' notes.
> > > Maybe we should delete this line.
> > >
> > > >
> > > > > + */
> > > > > +static inline uint16_t rte_eth_rx_descriptors_refill(uint16_t port_id,
> > > > > +		uint16_t queue_id, uint16_t nb) {
> > > > > +	struct rte_eth_fp_ops *p;
> > > > > +	void *qd;
> > > > > +
> > > > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > > > +		RTE_ETHDEV_LOG(ERR,
> > > > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > > > +			port_id, queue_id);
> > > > > +		rte_errno = ENODEV;
> > > > > +		return 0;
> > > >
> > > > If p->rx_descriptors_refill() is likely to return 0, this function
> > > > should
> > > not use 0
> > > > as return value to indicate errors.
> > > However, refer to dpdk code style in ethdev, most of API write like this.
> > > For example, 'rte_eth_rx/tx_burst', 'rte_eth_tx_prep'.
> > >
> > > I'm also confused what's return type for this due to I want to
> > > indicate errors and show the processed buffer number.
> >
> > OK. Thanks for the references.
> >
> > Looking at rte_eth_rx/tx_burst(), you could follow the same
> > conventions here,
> > i.e.:
> > - Use uint16_t as return type.
> > - Return 0 on error.
> > - Do not set rte_errno.
> > - Remove the "ENODEV" line from the @return description.
> > - Use RTE_ETHDEV_LOG(ERR,...) as the only method to indicate errors.
> >
> > I now see that you follow the convention of rte_eth_tx_prepare(). This
> > is also perfectly fine; then you just need to update the description
> > of @return to mention that the error value is set in rte_errno if a
> > value less than 'nb' is returned.
> 
> After further consideration, I have changed my mind:
> 
> The primary purpose of rte_eth_tx_prepare() is to test if a packet burst is valid,
> so the ability to return an error value is a natural requirement.
> 
> This is not the purpose your functions. The purpose of your functions
> resemble rte_eth_rx/tx_burst(), where there is no requirement to return an
> error value. So you should follow the convention of rte_eth_rx/tx_burst(), as I
> just suggested.

Agree.
> 
> >
> > >
> > > >
> > > > > +	}
> > > > > +#endif
> > > > > +
> > > > > +	p = &rte_eth_fp_ops[port_id];
> > > > > +	qd = p->rxq.data[queue_id];
> > > > > +
> > > > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx port_id=%u\n",
> port_id);
> > > > > +		rte_errno = ENODEV;
> > > > > +		return 0;
> > > > > +
> > > > > +	if (qd == NULL) {
> > > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> > > > port_id=%u\n",
> > > > > +			queue_id, port_id);
> > > > > +		rte_errno = ENODEV;
> > > > > +		return 0;
> > > > > +	}
> > > > > +#endif
> > > > > +
> > > > > +	if (p->rx_descriptors_refill == NULL)
> > > > > +		return 0;
> > > > > +
> > > > > +	return p->rx_descriptors_refill(qd, nb); }
> >
> > When does p->rx_descriptors_refill() return anything else than 'nb'?
> >
> > If p->rx_descriptors_refill() always succeeds (and thus always returns
> > 'nb'), you could make its return type void. And thus, you could also
> > make the return type of rte_eth_rx_descriptors_refill() void.
> >
> > > > > +
> > > > >  /**@{@name Rx hardware descriptor states
> > > > >   * @see rte_eth_rx_descriptor_status
> > > > >   */
> > > > > @@ -6483,6 +6597,122 @@ rte_eth_tx_buffer(uint16_t port_id,
> > > > > uint16_t
> > > > queue_id,
> > > > >  	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);  }
> > > > >
> > > > > +/**
> > > > > + * @internal
> > > > > + * Tx routine for rte_eth_dev_buf_recycle().
> > > > > + * Stash Tx used buffers into Rx buffer ring in buffer recycle mode.
> > > > > + *
> > > > > + * @note
> > > > > + * This API can only be called by rte_eth_dev_buf_recycle().
> > > > > + * After calling this API, rte_eth_rx_descriptors_refill()
> > > > > +should be
> > > > > + * called to refill Rx ring descriptors.
> > > > > + *
> > > > > + * When this functionality is not implemented in the driver,
> > > > > +the return
> > > > > + * buffer number is 0.
> > > > > + *
> > > > > + * @param port_id
> > > > > + *   The port identifier of the Ethernet device.
> > > > > + * @param queue_id
> > > > > + *   The index of the transmit queue.
> > > > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > > > supplied
> > > > > + *   to rte_eth_dev_configure().
> > > > > + * @param rxq_buf_recycle_info
> > > > > + *   A pointer to a structure of Rx queue buffer ring information in
> > > buffer
> > > > > + *   recycle mode.
> > > > > + *
> > > > > + * @return
> > > > > + *   The number buffers correct to be filled in the Rx buffer ring.
> > > > > + *   - ENODEV: bad port or queue (only if compiled with debug).
> > > >
> > > > If you want errors reported by the return value, the function
> > > > return type cannot be uint16_t.
> >
> > I now see that you follow the convention of rte_eth_tx_prepare() here too.
> >
> > This is perfectly fine; then you just need to update the description
> > of @return to mention that the error value is set in rte_errno if a
> > value less than 'nb' is returned.
> 
> Same comment about conventions as above: This function is more like
> rte_eth_rx/tx_burst() than rte_eth_tx_prepare(), so follow the convention of
> rte_eth_rx/tx_burst() instead.
> 
> >
> > > >
> > > > > + */
> > > > > +static inline uint16_t rte_eth_tx_buf_stash(uint16_t port_id,
> > > > > +uint16_t
> > > > > queue_id,
> > > > > +		struct rte_eth_rxq_buf_recycle_info
> *rxq_buf_recycle_info)
> > > > {
> > > > > +	struct rte_eth_fp_ops *p;
> > > > > +	void *qd;
> > > > > +
> > > > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > > > +	if (port_id >= RTE_MAX_ETHPORTS ||
> > > > > +			queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > > > > +		RTE_ETHDEV_LOG(ERR,
> > > > > +			"Invalid port_id=%u or queue_id=%u\n",
> > > > > +			port_id, queue_id);
> > > > > +		rte_errno = ENODEV;
> > > > > +		return 0;
> > > >
> > > > If p->tx_buf_stash() is likely to return 0, this function should
> > > > not use 0
> > > as
> > > > return value to indicate errors.
> >
> > I now see that you follow the convention of rte_eth_tx_prepare() here too.
> > Then please ignore my comment about using 0 as return value on errors
> > for this function.
> 
> Same comment about conventions as above: This function is more like
> rte_eth_rx/tx_burst() than rte_eth_tx_prepare(), so follow the convention of
> rte_eth_rx/tx_burst() instead.
> 
> >
> > > >
> > > > > +	}
> > > > > +#endif
> > > > > +
> > > > > +	p = &rte_eth_fp_ops[port_id];
> > > > > +	qd = p->txq.data[queue_id];
> > > > > +
> > > > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > > > +	if (!rte_eth_dev_is_valid_port(port_id)) {
> > > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx port_id=%u\n",
> port_id);
> > > > > +		rte_errno = ENODEV;
> > > > > +		return 0;
> > > > > +
> > > > > +	if (qd == NULL) {
> > > > > +		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> > > > port_id=%u\n",
> > > > > +			queue_id, port_id);
> > > > > +		rte_erno = ENODEV;
> > > > > +		return 0;
> > > > > +	}
> > > > > +#endif
> > > > > +
> > > > > +	if (p->tx_buf_stash == NULL)
> > > > > +		return 0;
> > > > > +
> > > > > +	return p->tx_buf_stash(qd, rxq_buf_recycle_info); }
> > > > > +
> > > > > +/**
> > > > > + * @warning
> > > > > + * @b EXPERIMENTAL: this API may change, or be removed, without
> > > > > +prior notice
> > > > > + *
> > > > > + * Buffer recycle mode can let Tx queue directly put used
> > > > > +buffers into Rx
> > > > > buffer
> > > > > + * ring. This avoids freeing buffers into mempool and
> > > > > + allocating buffers from
> > > > > + * mempool.
> > > >
> > > > This function description generally describes the buffer recycle mode.
> > > >
> > > > Please update it to describe what this specific function does.
> > > Ack.
> > > >
> > > > > + *
> > > > > + * @param rx_port_id
> > > > > + *   Port identifying the receive side.
> > > > > + * @param rx_queue_id
> > > > > + *   The index of the receive queue identifying the receive side.
> > > > > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > > > supplied
> > > > > + *   to rte_eth_dev_configure().
> > > > > + * @param tx_port_id
> > > > > + *   Port identifying the transmit side.
> > > > > + * @param tx_queue_id
> > > > > + *   The index of the transmit queue identifying the transmit side.
> > > > > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > > > supplied
> > > > > + *   to rte_eth_dev_configure().
> > > > > + * @param rxq_recycle_info
> > > > > + *   A pointer to a structure of type *rte_eth_txq_rearm_data* to be
> > > filled.
> > > > > + * @return
> > > > > + *   - (0) on success or no recycling buffer.
> > > >
> > > > Why not return the return value of rte_eth_rx_descriptors_refill()
> > > > instead
> > > of
> > > > 0 on success? (This is a question, not a suggestion.)
> > > >
> > > > Or, if rxq_buf_recycle_info must be valid, the function return
> > > > type could
> > be
> > > > void instead of int.
> > > >
> > > In some time, users may forget to allocate the room for '
> > rxq_buf_recycle_info
> > > '
> > > and call 'rte_rxq_buf_recycle_info_get' API. Thus, I think we need
> > > to check with this.
> >
> > If the user forgets to allocate the rxq_buf_recycle_info, it is a
> > serious bug in the user's application.
> >
> > We don't need to handle such bugs at runtime.
> >
> > >
> > > Furthermore, the return value of this API should return success or not.
> >
> > If rxq_buf_recycle_info is not NULL, this function will always
> > succeed. So there is no need for a return value.
> >
> > If you want this function to return something, it could return nb_buf,
> > so the application can it use for telemetry purposes or similar.
> >
> > >
> > > > > + *   - (-EINVAL) rxq_recycle_info is NULL.
> > > > > + */
> > > > > +__rte_experimental
> > > > > +static inline int
> > > > > +rte_eth_dev_buf_recycle(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > > > +		uint16_t tx_port_id, uint16_t tx_queue_id,
> > > > > +		struct rte_eth_rxq_buf_recycle_info
> *rxq_buf_recycle_info)
> > > > {
> > > > > +	/* The number of recycling buffers. */
> > > > > +	uint16_t nb_buf;
> > > > > +
> > > > > +	if (!rxq_buf_recycle_info)
> > > > > +		return -EINVAL;
> > > >
> > > > This is a fast path function. In which situation is this function
> > > > called
> > > with
> > > > rxq_buf_recycle_info == NULL?
> > > >
> > > > If this function can genuinely be called with rxq_buf_recycle_info
> > > > ==
> > NULL,
> > > > you should test for (rxq_buf_recycle_info == NULL), not (!
> > > > rxq_buf_recycle_info). Otherwise, I think
> > > > RTE_ASSERT(rxq_buf_recycle_info != NULL) is more appropriate.
> > > Agree. We should use ' RTE_ASSERT(rxq_buf_recycle_info != NULL)'.
> > > >
> > > > > +
> > > > > +	/* Stash Tx used buffers into Rx buffer ring */
> > > > > +	nb_buf = rte_eth_tx_buf_stash(tx_port_id, tx_queue_id,
> > > > > +				rxq_buf_recycle_info);
> > > > > +	/* If there are recycling buffers, refill Rx queue descriptors.
> > */
> > > > > +	if (nb_buf)
> > > > > +		rte_eth_rx_descriptors_refill(rx_port_id, rx_queue_id,
> > > > > +					nb_buf);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * @warning
> > > > >   * @b EXPERIMENTAL: this API may change without prior notice
> > > > > diff --git a/lib/ethdev/rte_ethdev_core.h
> > > > > b/lib/ethdev/rte_ethdev_core.h index dcf8adab92..a138fd4dbc
> > > > > 100644
> > > > > --- a/lib/ethdev/rte_ethdev_core.h
> > > > > +++ b/lib/ethdev/rte_ethdev_core.h
> > > > > @@ -56,6 +56,13 @@ typedef int
> > > > > (*eth_rx_descriptor_status_t)(void
> > > > > *rxq, uint16_t offset);
> > > > >  /** @internal Check the status of a Tx descriptor */  typedef
> > > > > int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> > > > >
> > > > > +/** @internal Stash Tx used buffers into RX ring in buffer
> > > > > +recycle mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > > > > +		struct rte_eth_rxq_buf_recycle_info
> > *rxq_buf_recycle_info);
> > > > > +
> > > > > +/** @internal Refill Rx descriptors in buffer recycle mode */
> > > > > +typedef uint16_t (*eth_rx_descriptors_refill_t)(void *rxq,
> > > > > +uint16_t nb);
> > > >
> > > > Please add proper function descriptions for the two callbacks above.
> > > Ack.
> > > >
> > > > > +
> > > > >  /**
> > > > >   * @internal
> > > > >   * Structure used to hold opaque pointers to internal ethdev
> > > > > Rx/Tx @@
> > > > > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> > > > >  	eth_rx_queue_count_t rx_queue_count;
> > > > >  	/** Check the status of a Rx descriptor. */
> > > > >  	eth_rx_descriptor_status_t rx_descriptor_status;
> > > > > +	/** Refill Rx descriptors in buffer recycle mode */
> > > > > +	eth_rx_descriptors_refill_t rx_descriptors_refill;
> > > > >  	/** Rx queues data. */
> > > > >  	struct rte_ethdev_qdata rxq;
> > > > > -	uintptr_t reserved1[3];
> > > > > +	uintptr_t reserved1[4];
> > > >
> > > > You added a function pointer above, so to keep the structure
> > > > alignment,
> > you
> > > > must remove one here, not add one:
> > > >
> > > > -	uintptr_t reserved1[3];
> > > > +	uintptr_t reserved1[2];
> > > >
> > > Ack.
> > > > >  	/**@}*/
> > > > >
> > > > >  	/**@{*/
> > > > > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> > > > >  	eth_tx_prep_t tx_pkt_prepare;
> > > > >  	/** Check the status of a Tx descriptor. */
> > > > >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > > > > +	/** Stash Tx used buffers into RX ring in buffer recycle mode */
> > > > > +	eth_tx_buf_stash_t tx_buf_stash;
> > > > >  	/** Tx queues data. */
> > > > >  	struct rte_ethdev_qdata txq;
> > > > > -	uintptr_t reserved2[3];
> > > > > +	uintptr_t reserved2[4];
> > > >
> > > > You added a function pointer above, so to keep the structure
> > > > alignment,
> > you
> > > > must remove one here, not add one:
> > > >
> > > > -	uintptr_t reserved1[3];
> > > > +	uintptr_t reserved1[2];
> > > >
> > > Ack.
> > > > >  	/**@}*/
> > > > >
> > > > >  } __rte_cache_aligned;


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
  2023-04-19 14:46     ` Ferruh Yigit
@ 2023-04-26  7:29       ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-04-26  7:29 UTC (permalink / raw)
  To: Ferruh Yigit, thomas, Andrew Rybchenko
  Cc: dev, konstantin.v.ananyev, mb, nd, Honnappa Nagarahalli,
	Ruifeng Wang, nd



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Wednesday, April 19, 2023 10:46 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v5 1/3] ethdev: add API for buffer recycle mode
> 
> On 3/30/2023 7:29 AM, Feifei Wang wrote:
> > There are 4 upper APIs for buffer recycle mode:
> > 1. 'rte_eth_rx_queue_buf_recycle_info_get'
> > This is to retrieve buffer ring information about given ports's Rx
> > queue in buffer recycle mode. And due to this, buffer recycle can be
> > no longer limited to the same driver in Rx and Tx.
> >
> > 2. 'rte_eth_dev_buf_recycle'
> > Users can call this API to enable buffer recycle mode in data path.
> > There are 2 internal APIs in it, which is separately for Rx and TX.
> >
> 
> Overall, can we have a namespace in the functions related to the buffer
> recycle, to clarify API usage, something like (just putting as sample to clarify
> my point):
> 
> rte_eth_recycle_buf
> rte_eth_recycle_tx_buf_stash
> rte_eth_recycle_rx_descriptors_refill
> rte_eth_recycle_rx_queue_info_get
> 
Agree.

> 
> > 3. 'rte_eth_tx_buf_stash'
> > Internal API for buffer recycle mode. This is to stash Tx used buffers
> > into Rx buffer ring.
> >
> 
> This API is to move/recycle descriptors from Tx queue to Rx queue, but name
> on its own, 'rte_eth_tx_buf_stash', reads like we are stashing something to Tx
> queue. What do you think, can naming be improved?
> 
Agree with this. And new name is 'rte_eth_tx_mbuf_reuse'

> > 4. 'rte_eth_rx_descriptors_refill'
> > Internal API for buffer recycle mode. This is to refill Rx
> > descriptors.
> >
> > Above all APIs are just implemented at the upper level.
> > For different APIs, we need to define specific functions separately.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> <...>
> 
> >
> > +int
> > +rte_eth_rx_queue_buf_recycle_info_get(uint16_t port_id, uint16_t
> queue_id,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> > +	dev = &rte_eth_devices[port_id];
> > +
> > +	if (queue_id >= dev->data->nb_rx_queues) {
> > +		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n",
> queue_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	RTE_ASSERT(rxq_buf_recycle_info != NULL);
> > +
> 
> This is slow path API, I think better to validate parameter and return an error
> instead of assert().

Thanks for the remind. Here I'll delete this check due to I realize it is unnecessary to check if users
have allocated memory for 'rxq_buf_recycle_info', which is an input parameter.

> 
> <...>
> 
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > *rxq, uint16_t offset);
> >  /** @internal Check the status of a Tx descriptor */  typedef int
> > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> >
> > +/** @internal Stash Tx used buffers into RX ring in buffer recycle
> > +mode */ typedef uint16_t (*eth_tx_buf_stash_t)(void *txq,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info);
> > +
> > +/** @internal Refill Rx descriptors in buffer recycle mode */ typedef
> > +uint16_t (*eth_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> > +
> 
> Since there is only single API exposed to the application, is it really required to
> have two dev_ops, why not have a single one?

Two API is due to we need to separate Rx/TX path.  Then buffer recycle can support
the case of different pmds. For example, Rx port is i40e, and Tx port is ixgbe.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v5 3/3] net/ixgbe: implement recycle buffer mode
  2023-04-19 14:46     ` Ferruh Yigit
@ 2023-04-26  7:36       ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-04-26  7:36 UTC (permalink / raw)
  To: Ferruh Yigit, Qiming Yang, Wenjun Wu
  Cc: dev, konstantin.v.ananyev, mb, nd, Honnappa Nagarahalli,
	Ruifeng Wang, nd



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Wednesday, April 19, 2023 10:47 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Qiming Yang
> <qiming.yang@intel.com>; Wenjun Wu <wenjun1.wu@intel.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru;
> mb@smartsharesystems.com; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v5 3/3] net/ixgbe: implement recycle buffer mode
> 
> On 3/30/2023 7:29 AM, Feifei Wang wrote:
> > Define specific function implementation for ixgbe driver.
> > Currently, recycle buffer mode can support 128bit vector path. And can
> > be enabled both in fast free and no fast free mode.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  drivers/net/ixgbe/ixgbe_ethdev.c |   1 +
> >  drivers/net/ixgbe/ixgbe_ethdev.h |   3 +
> >  drivers/net/ixgbe/ixgbe_rxtx.c   | 153
> +++++++++++++++++++++++++++++++
> >  drivers/net/ixgbe/ixgbe_rxtx.h   |   4 +
> >  4 files changed, 161 insertions(+)
> >
> 
> What do you think to extract buf_recycle related code in drivers into its own
> file, this may help to manager maintainership of code easier?
Good comment, this will make code clean and easy to maintain.
> 
> <...>
> 
> > +uint16_t
> > +ixgbe_tx_buf_stash_vec(void *tx_queue,
> > +		struct rte_eth_rxq_buf_recycle_info *rxq_buf_recycle_info) {
> > +	struct ixgbe_tx_queue *txq = tx_queue;
> > +	struct ixgbe_tx_entry *txep;
> > +	struct rte_mbuf **rxep;
> > +	struct rte_mbuf *m[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
> > +	int i, j, n;
> > +	uint32_t status;
> > +	uint16_t avail = 0;
> > +	uint16_t buf_ring_size = rxq_buf_recycle_info->buf_ring_size;
> > +	uint16_t mask = rxq_buf_recycle_info->buf_ring_size - 1;
> > +	uint16_t refill_request = rxq_buf_recycle_info->refill_request;
> > +	uint16_t refill_head = *rxq_buf_recycle_info->refill_head;
> > +	uint16_t receive_tail = *rxq_buf_recycle_info->receive_tail;
> > +
> > +	/* Get available recycling Rx buffers. */
> > +	avail = (buf_ring_size - (refill_head - receive_tail)) & mask;
> > +
> > +	/* Check Tx free thresh and Rx available space. */
> > +	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
> > +		return 0;
> > +
> > +	/* check DD bits on threshold descriptor */
> > +	status = txq->tx_ring[txq->tx_next_dd].wb.status;
> > +	if (!(status & IXGBE_ADVTXD_STAT_DD))
> > +		return 0;
> > +
> > +	n = txq->tx_rs_thresh;
> > +
> > +	/* Buffer recycle can only support no ring buffer wraparound.
> > +	 * Two case for this:
> > +	 *
> > +	 * case 1: The refill head of Rx buffer ring needs to be aligned with
> > +	 * buffer ring size. In this case, the number of Tx freeing buffers
> > +	 * should be equal to refill_request.
> > +	 *
> > +	 * case 2: The refill head of Rx ring buffer does not need to be aligned
> > +	 * with buffer ring size. In this case, the update of refill head can not
> > +	 * exceed the Rx buffer ring size.
> > +	 */
> > +	if (refill_request != n ||
> > +		(!refill_request && (refill_head + n > buf_ring_size)))
> > +		return 0;
> > +
> > +	/* First buffer to free from S/W ring is at index
> > +	 * tx_next_dd - (tx_rs_thresh-1).
> > +	 */
> > +	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > +	rxep = rxq_buf_recycle_info->buf_ring;
> > +	rxep += refill_head;
> > +
> > +	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > +		/* Directly put mbufs from Tx to Rx. */
> > +		for (i = 0; i < n; i++, rxep++, txep++)
> > +			*rxep = txep[0].mbuf;
> > +	} else {
> > +		for (i = 0, j = 0; i < n; i++) {
> > +			/* Avoid txq contains buffers from expected mempoo.
> */
> 
> mempool (unless trying to introduce a new concept :)
Agree.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue
  2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
                   ` (6 preceding siblings ...)
  2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
@ 2023-05-25  9:45 ` Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
                     ` (3 more replies)
  7 siblings, 4 replies; 67+ messages in thread
From: Feifei Wang @ 2023-05-25  9:45 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as mbufs
recycle mode.

In the latest version, mbufs recycle mode is packaged as a separate API. 
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing  at run-time and user does not need to
know the direction of flow in advance. This can effectively expand mbufs recycle mode's
use scenarios.

Furthermore, mbufs recycle mode is no longer limited to the same pmd,
it can support moving mbufs between different vendor pmds, even can put the mbufs
anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by mbufs recycle mode.

Difference between mbuf recycle, ZC API used in mempool and general path
For general path: 
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
                Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
                Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
                Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
                Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
So, mbufs recycle has its own benefits.

Testing status:
(1) dpdk l3fwd test with multiple drivers:
    port 0: 82599 NIC   port 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Thunderx2:      +7.53%	                +13.54%
-------------------------------------------------------------

(2) dpdk l3fwd test with same driver:
    port 0 && 1: XL710 NIC
-------------------------------------------------------------
		Without fast free	With fast free
Ampere altra:   +12.61%		        +11.42%
n1sdp:		+8.30%			+3.85%
x86-sse:	+8.43%			+3.72%
-------------------------------------------------------------

(3) Performance comparison with ZC_mempool used
    port 0 && 1: XL710 NIC
    with fast free
-------------------------------------------------------------
		With recycle buffer	With zc_mempool
Ampere altra:	11.42%			3.54%
-------------------------------------------------------------

Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
I/O bottleneck in testpmd in ampere altra, we can not see throughput change
compared with I/O fwd engine. However, using record cmd in testpmd:
'$set record-burst-stats on'
we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
indicate mbufs recycle can save CPU cycles.

V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)

V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm

v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)

v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path

v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
   API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)

Feifei Wang (4):
  ethdev: add API for mbufs recycle mode
  net/i40e: implement mbufs recycle mode
  net/ixgbe: implement mbufs recycle mode
  app/testpmd: add recycle mbufs engine

 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/recycle_mbufs.c                  |  79 ++++++++
 app/test-pmd/testpmd.c                        |   1 +
 app/test-pmd/testpmd.h                        |   3 +
 doc/guides/rel_notes/release_23_07.rst        |   7 +
 doc/guides/testpmd_app_ug/run_app.rst         |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_ethdev.h                |   2 +
 .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 140 ++++++++++++++
 drivers/net/i40e/i40e_rxtx.c                  |  32 +++
 drivers/net/i40e/i40e_rxtx.h                  |   4 +
 drivers/net/i40e/meson.build                  |   2 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h              |   3 +
 .../ixgbe/ixgbe_recycle_mbufs_vec_common.c    | 136 +++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.c                |  29 +++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   4 +
 drivers/net/ixgbe/meson.build                 |   2 +
 lib/ethdev/ethdev_driver.h                    |  10 +
 lib/ethdev/ethdev_private.c                   |   2 +
 lib/ethdev/rte_ethdev.c                       |  31 +++
 lib/ethdev/rte_ethdev.h                       | 182 ++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h                  |  15 +-
 lib/ethdev/version.map                        |   4 +
 25 files changed, 694 insertions(+), 3 deletions(-)
 create mode 100644 app/test-pmd/recycle_mbufs.c
 create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
 create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
@ 2023-05-25  9:45   ` Feifei Wang
  2023-05-25 15:08     ` Morten Brørup
  2023-06-05 12:53     ` Константин Ананьев
  2023-05-25  9:45   ` [PATCH v6 2/4] net/i40e: implement " Feifei Wang
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 67+ messages in thread
From: Feifei Wang @ 2023-05-25  9:45 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.

For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/rel_notes/release_23_07.rst |   7 +
 lib/ethdev/ethdev_driver.h             |  10 ++
 lib/ethdev/ethdev_private.c            |   2 +
 lib/ethdev/rte_ethdev.c                |  31 +++++
 lib/ethdev/rte_ethdev.h                | 182 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h           |  15 +-
 lib/ethdev/version.map                 |   4 +
 7 files changed, 249 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rel_notes/release_23_07.rst b/doc/guides/rel_notes/release_23_07.rst
index a9b1293689..f279036cb9 100644
--- a/doc/guides/rel_notes/release_23_07.rst
+++ b/doc/guides/rel_notes/release_23_07.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Add mbufs recycling support. **
+  Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+  APIs which allow the user to copy used mbufs from the Tx mbuf ring
+  into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+  device is different from the Tx Ethernet device with respective driver
+  callback functions in ``rte_eth_recycle_mbufs``.
+
 
 Removed Items
 -------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 2c9d615fb5..c6723d5277 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@ struct rte_eth_dev {
 	eth_rx_descriptor_status_t rx_descriptor_status;
 	/** Check the status of a Tx descriptor */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Pointer to PMD transmit mbufs reuse function */
+	eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+	/** Pointer to PMD receive descriptors refill function */
+	eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
 
 	/**
 	 * Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
 	uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
 
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+	uint16_t rx_queue_id,
+	struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
 	uint16_t queue_id, struct rte_eth_burst_mode *mode);
 
@@ -1247,6 +1255,8 @@ struct eth_dev_ops {
 	eth_rxq_info_get_t         rxq_info_get;
 	/** Retrieve Tx queue information */
 	eth_txq_info_get_t         txq_info_get;
+	/** Retrieve mbufs recycle Rx queue information */
+	eth_recycle_rxq_info_get_t recycle_rxq_info_get;
 	eth_burst_mode_get_t       rx_burst_mode_get; /**< Get Rx burst mode */
 	eth_burst_mode_get_t       tx_burst_mode_get; /**< Get Tx burst mode */
 	eth_fw_version_get_t       fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
 	fpo->rx_queue_count = dev->rx_queue_count;
 	fpo->rx_descriptor_status = dev->rx_descriptor_status;
 	fpo->tx_descriptor_status = dev->tx_descriptor_status;
+	fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+	fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
 
 	fpo->rxq.data = dev->data->rx_queues;
 	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 4d03255683..7c27dcfea4 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5784,6 +5784,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	return 0;
 }
 
+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id >= dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+		return -EINVAL;
+	}
+
+	if (dev->data->rx_queues == NULL ||
+			dev->data->rx_queues[queue_id] == NULL) {
+		RTE_ETHDEV_LOG(ERR,
+			   "Rx queue %"PRIu16" of device with port_id=%"
+			   PRIu16" has not been setup\n",
+			   queue_id, port_id);
+		return -EINVAL;
+	}
+
+	if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+		return -ENOTSUP;
+
+	dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+	return 0;
+}
+
 int
 rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
 			  struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 99fe9e238b..7434aa2483 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 } __rte_cache_min_aligned;
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+	struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+	struct rte_mempool *mp;     /**< mempool of Rx queue. */
+	uint16_t *refill_head;      /**< head of Rx queue refilling mbufs. */
+	uint16_t *receive_tail;     /**< tail of Rx queue receiving pkts. */
+	uint16_t mbuf_ring_size;     /**< configured number of mbuf ring size. */
+	/**
+	 * Requirement on mbuf refilling batch size of Rx mbuf ring.
+	 * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+	 * should be aligned with mbuf ring size, in order to simplify
+	 * ring wrapping around.
+	 * Value 0 means that PMD drivers have no requirement for this.
+	 */
+	uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
 /* Generic Burst mode flag definition, values can be ORed. */
 
 /**
@@ -4809,6 +4833,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param queue_id
+ *   The Rx queue on the Ethernet devicefor which information
+ *   will be retrieved.
+ * @param recycle_rxq_info
+ *   A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENODEV:  If *port_id* is invalid.
+ *   - -ENOTSUP: routine is not supported by the device PMD.
+ *   - -EINVAL:  The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+		uint16_t queue_id,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 /**
  * Retrieve information about the Rx packet burst mode.
  *
@@ -6483,6 +6532,139 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
 	return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ *   from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can only support one-time pairing
+ * between the receive queue and transmit queue. Do not pair one receive queue with
+ * multiple transmit queues or pair one transmit queue with multiple receive queues,
+ * in order to avoid memory error rewriting.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ *   A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ *   the information of the Rx queue mbuf ring.
+ * @return
+ *   The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct rte_eth_fp_ops *p;
+	void *qd;
+	uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	if (tx_port_id >= RTE_MAX_ETHPORTS ||
+			tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR,
+				"Invalid tx_port_id=%u or tx_queue_id=%u\n",
+				tx_port_id, tx_queue_id);
+		return 0;
+	}
+#endif
+
+	/* fetch pointer to queue data */
+	p = &rte_eth_fp_ops[tx_port_id];
+	qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+				tx_queue_id, tx_port_id);
+		return 0;
+	}
+#endif
+	if (p->recycle_tx_mbufs_reuse == NULL)
+		return 0;
+
+	/* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+	 * into Rx mbuf ring.
+	 */
+	nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+	/* If no recycling mbufs, return 0. */
+	if (nb_mbufs == 0)
+		return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	if (rx_port_id >= RTE_MAX_ETHPORTS ||
+			rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+		RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+				rx_port_id, rx_queue_id);
+		return 0;
+	}
+#endif
+
+	/* fetch pointer to queue data */
+	p = &rte_eth_fp_ops[rx_port_id];
+	qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+	if (qd == NULL) {
+		RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+				rx_queue_id, rx_port_id);
+		return 0;
+	}
+#endif
+
+	if (p->recycle_rx_descriptors_refill == NULL)
+		return 0;
+
+	/* Replenish the Rx descriptors with the recycling
+	 * into Rx mbuf ring.
+	 */
+	p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+	return nb_mbufs;
+}
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..a2e6ea6b6c 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
 /** @internal Check the status of a Tx descriptor */
 typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
 
+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
 /**
  * @internal
  * Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,9 +97,11 @@ struct rte_eth_fp_ops {
 	eth_rx_queue_count_t rx_queue_count;
 	/** Check the status of a Rx descriptor. */
 	eth_rx_descriptor_status_t rx_descriptor_status;
+	/** Refill Rx descriptors with the recycling mbufs. */
+	eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
 	/** Rx queues data. */
 	struct rte_ethdev_qdata rxq;
-	uintptr_t reserved1[3];
+	uintptr_t reserved1[2];
 	/**@}*/
 
 	/**@{*/
@@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
 	eth_tx_prep_t tx_pkt_prepare;
 	/** Check the status of a Tx descriptor. */
 	eth_tx_descriptor_status_t tx_descriptor_status;
+	/** Copy used mbufs from Tx mbuf ring into Rx. */
+	eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
 	/** Tx queues data. */
 	struct rte_ethdev_qdata txq;
-	uintptr_t reserved2[3];
+	uintptr_t reserved2[2];
 	/**@}*/
 
 } __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 357d1a88c0..45c417f6bd 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -299,6 +299,10 @@ EXPERIMENTAL {
 	rte_flow_action_handle_query_update;
 	rte_flow_async_action_handle_query_update;
 	rte_flow_async_create_by_index;
+
+	# added in 23.07
+	rte_eth_recycle_mbufs;
+	rte_eth_recycle_rx_queue_info_get;
 };
 
 INTERNAL {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-05-25  9:45   ` Feifei Wang
  2023-06-05 13:02     ` Константин Ананьев
  2023-05-25  9:45   ` [PATCH v6 3/4] net/ixgbe: " Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
  3 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-05-25  9:45 UTC (permalink / raw)
  To: Yuying Zhang, Beilei Xing
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_ethdev.h                |   2 +
 .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 140 ++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.c                  |  32 ++++
 drivers/net/i40e/i40e_rxtx.h                  |   4 +
 drivers/net/i40e/meson.build                  |   2 +
 6 files changed, 181 insertions(+)
 create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f9d8f9791f..d4eecd16cf 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.flow_ops_get                 = i40e_dev_flow_ops_get,
 	.rxq_info_get                 = i40e_rxq_info_get,
 	.txq_info_get                 = i40e_txq_info_get,
+	.recycle_rxq_info_get         = i40e_recycle_rxq_info_get,
 	.rx_burst_mode_get            = i40e_rx_burst_mode_get,
 	.tx_burst_mode_get            = i40e_tx_burst_mode_get,
 	.timesync_enable              = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 9b806d130e..b5b2d6cf2b 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_recycle_rxq_info *recycle_rxq_info);
 int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
 			   struct rte_eth_burst_mode *mode);
 int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..08d708fd7d
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+	struct i40e_rx_queue *rxq = rx_queue;
+	struct i40e_rx_entry *rxep;
+	volatile union i40e_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb_mbufs; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* flush desc with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb_mbufs;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = 0;
+		rx_id = rxq->nb_rx_desc - 1;
+	}
+
+	rxq->rxrearm_nb -= nb_mbufs;
+
+	rte_io_wmb();
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+	struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct i40e_tx_queue *txq = tx_queue;
+	struct i40e_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint16_t avail = 0;
+	uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+	uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+	uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+	uint16_t refill_head = *recycle_rxq_info->refill_head;
+	uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Mbufs recycle mode can only support no ring buffer wrapping around.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * mbuf ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_requirement.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with mbuf ring size. In this case, the update of refill head can not
+	 * exceed the Rx mbuf ring size.
+	 */
+	if (refill_requirement != n ||
+		(!refill_requirement && (refill_head + n > mbuf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = recycle_rxq_info->mbuf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempool. */
+			if (unlikely(recycle_rxq_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_requirement))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..53cf787f04 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3197,6 +3197,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.offloads = txq->offloads;
 }
 
+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct i40e_rx_queue *rxq;
+	struct i40e_adapter *ad =
+		I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+	recycle_rxq_info->mp = rxq->mp;
+	recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+	recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+	if (ad->rx_vec_allowed) {
+		recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+		recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+	} else {
+		recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+		recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+	}
+}
+
 #ifdef RTE_ARCH_X86
 static inline bool
 get_avx_supported(bool request_avx512)
@@ -3291,6 +3315,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
 				dev->rx_pkt_burst = ad->rx_use_avx2 ?
 					i40e_recv_scattered_pkts_vec_avx2 :
 					i40e_recv_scattered_pkts_vec;
+				dev->recycle_rx_descriptors_refill =
+					i40e_recycle_rx_descriptors_refill_vec;
 			}
 		} else {
 			if (ad->rx_use_avx512) {
@@ -3309,9 +3335,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
 				dev->rx_pkt_burst = ad->rx_use_avx2 ?
 					i40e_recv_pkts_vec_avx2 :
 					i40e_recv_pkts_vec;
+				dev->recycle_rx_descriptors_refill =
+					i40e_recycle_rx_descriptors_refill_vec;
 			}
 		}
 #else /* RTE_ARCH_X86 */
+		dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
 		if (dev->data->scattered_rx) {
 			PMD_INIT_LOG(DEBUG,
 				     "Using Vector Scattered Rx (port %d).",
@@ -3479,15 +3508,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
 				dev->tx_pkt_burst = ad->tx_use_avx2 ?
 						    i40e_xmit_pkts_vec_avx2 :
 						    i40e_xmit_pkts_vec;
+				dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
 			}
 #else /* RTE_ARCH_X86 */
 			PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
 				     dev->data->port_id);
 			dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+			dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
 #endif /* RTE_ARCH_X86 */
 		} else {
 			PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
 			dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+			dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
 		}
 		dev->tx_pkt_prepare = i40e_simple_prep_pkts;
 	} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..ed8921ddc0 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
 int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
 int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
 
+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
 uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			    uint16_t nb_pkts);
 uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..58eb627abc 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -42,6 +42,8 @@ testpmd_sources = files('i40e_testpmd.c')
 deps += ['hash']
 includes += include_directories('base')
 
+sources += files('i40e_recycle_mbufs_vec_common.c')
+
 if arch_subdir == 'x86'
     sources += files('i40e_rxtx_vec_sse.c')
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 3/4] net/ixgbe: implement mbufs recycle mode
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 2/4] net/i40e: implement " Feifei Wang
@ 2023-05-25  9:45   ` Feifei Wang
  2023-05-25  9:45   ` [PATCH v6 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
  3 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-05-25  9:45 UTC (permalink / raw)
  To: Qiming Yang, Wenjun Wu
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h              |   3 +
 .../ixgbe/ixgbe_recycle_mbufs_vec_common.c    | 136 ++++++++++++++++++
 drivers/net/ixgbe/ixgbe_rxtx.c                |  29 ++++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   4 +
 drivers/net/ixgbe/meson.build                 |   2 +
 6 files changed, 175 insertions(+)
 create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 88118bc305..db47100a37 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
 	.set_mc_addr_list     = ixgbe_dev_set_mc_addr_list,
 	.rxq_info_get         = ixgbe_rxq_info_get,
 	.txq_info_get         = ixgbe_txq_info_get,
+	.recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
 	.timesync_enable      = ixgbe_timesync_enable,
 	.timesync_disable     = ixgbe_timesync_disable,
 	.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 48290af512..9169930c7f 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -625,6 +625,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	struct rte_eth_txq_info *qinfo);
 
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
 
 void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..f234e95e92
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+	struct ixgbe_rx_queue *rxq = rx_queue;
+	struct ixgbe_rx_entry *rxep;
+	volatile union ixgbe_adv_rx_desc *rxdp;
+	uint16_t rx_id;
+	uint64_t paddr;
+	uint64_t dma_addr;
+	uint16_t i;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	for (i = 0; i < nb_mbufs; i++) {
+		/* Initialize rxdp descs. */
+		paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+		dma_addr = rte_cpu_to_le_64(paddr);
+		/* Flush descriptors with pa dma_addr */
+		rxdp[i].read.hdr_addr = 0;
+		rxdp[i].read.pkt_addr = dma_addr;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += nb_mbufs;
+	if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+		rxq->rxrearm_start = 0;
+
+	rxq->rxrearm_nb -= nb_mbufs;
+
+	rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+			(rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+	/* Update the tail pointer on the NIC */
+	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct ixgbe_tx_queue *txq = tx_queue;
+	struct ixgbe_tx_entry *txep;
+	struct rte_mbuf **rxep;
+	struct rte_mbuf *m[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
+	int i, j, n;
+	uint32_t status;
+	uint16_t avail = 0;
+	uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+	uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+	uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+	uint16_t refill_head = *recycle_rxq_info->refill_head;
+	uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+	/* Get available recycling Rx buffers. */
+	avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+	/* Check Tx free thresh and Rx available space. */
+	if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+		return 0;
+
+	/* check DD bits on threshold descriptor */
+	status = txq->tx_ring[txq->tx_next_dd].wb.status;
+	if (!(status & IXGBE_ADVTXD_STAT_DD))
+		return 0;
+
+	n = txq->tx_rs_thresh;
+
+	/* Mbufs recycle can only support no ring buffer wrapping around.
+	 * Two case for this:
+	 *
+	 * case 1: The refill head of Rx buffer ring needs to be aligned with
+	 * buffer ring size. In this case, the number of Tx freeing buffers
+	 * should be equal to refill_requirement.
+	 *
+	 * case 2: The refill head of Rx ring buffer does not need to be aligned
+	 * with buffer ring size. In this case, the update of refill head can not
+	 * exceed the Rx buffer ring size.
+	 */
+	if (refill_requirement != n ||
+		(!refill_requirement && (refill_head + n > mbuf_ring_size)))
+		return 0;
+
+	/* First buffer to free from S/W ring is at index
+	 * tx_next_dd - (tx_rs_thresh-1).
+	 */
+	txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+	rxep = recycle_rxq_info->mbuf_ring;
+	rxep += refill_head;
+
+	if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+		/* Directly put mbufs from Tx to Rx. */
+		for (i = 0; i < n; i++, rxep++, txep++)
+			*rxep = txep[0].mbuf;
+	} else {
+		for (i = 0, j = 0; i < n; i++) {
+			/* Avoid txq contains buffers from expected mempool. */
+			if (unlikely(recycle_rxq_info->mp
+						!= txep[i].mbuf->pool))
+				return 0;
+
+			m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+			/* In case 1, each of Tx buffers should be the
+			 * last reference.
+			 */
+			if (unlikely(m[j] == NULL && refill_requirement))
+				return 0;
+			/* In case 2, the number of valid Tx free
+			 * buffers should be recorded.
+			 */
+			j++;
+		}
+		rte_memcpy(rxep, m, sizeof(void *) * j);
+	}
+
+	/* Update counters for Tx. */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+	return n;
+}
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index c9d6ca9efe..ef2ea84bfa 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2558,6 +2558,7 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
 				(rte_eal_process_type() != RTE_PROC_PRIMARY ||
 					ixgbe_txq_vec_setup(txq) == 0)) {
 			PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+			dev->recycle_tx_mbufs_reuse = ixgbe_recycle_tx_mbufs_reuse_vec;
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		} else
 		dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4823,6 +4824,8 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
 					    "callback (port=%d).",
 				     dev->data->port_id);
 
+			dev->recycle_rx_descriptors_refill =
+				ixgbe_recycle_rx_descriptors_refill_vec;
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
 		} else if (adapter->rx_bulk_alloc_allowed) {
 			PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4852,6 +4855,7 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
 			     RTE_IXGBE_DESCS_PER_LOOP,
 			     dev->data->port_id);
 
+		dev->recycle_rx_descriptors_refill = ixgbe_recycle_rx_descriptors_refill_vec;
 		dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
 	} else if (adapter->rx_bulk_alloc_allowed) {
 		PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5623,6 +5627,31 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 	qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
 }
 
+void
+ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+	struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+	struct ixgbe_rx_queue *rxq;
+	struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+	rxq = dev->data->rx_queues[queue_id];
+
+	recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+	recycle_rxq_info->mp = rxq->mb_pool;
+	recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+	recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+	if (adapter->rx_vec_allowed) {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+		recycle_rxq_info->refill_requirement = RTE_IXGBE_RXQ_REARM_THRESH;
+		recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+#endif
+	} else {
+		recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+		recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+	}
+}
+
 /*
  * [VF] Initializes Receive Unit.
  */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..ee89c89929 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
 extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
 extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
 
+uint16_t ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+		struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
 uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 				    uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/meson.build b/drivers/net/ixgbe/meson.build
index a18908ef7c..0ae12dd5ff 100644
--- a/drivers/net/ixgbe/meson.build
+++ b/drivers/net/ixgbe/meson.build
@@ -26,11 +26,13 @@ deps += ['hash', 'security']
 
 if arch_subdir == 'x86'
     sources += files('ixgbe_rxtx_vec_sse.c')
+    sources += files('ixgbe_recycle_mbufs_vec_common.c')
     if is_windows and cc.get_id() != 'clang'
         cflags += ['-fno-asynchronous-unwind-tables']
     endif
 elif arch_subdir == 'arm'
     sources += files('ixgbe_rxtx_vec_neon.c')
+    sources += files('ixgbe_recycle_mbufs_vec_common.c')
 endif
 
 includes += include_directories('base')
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v6 4/4] app/testpmd: add recycle mbufs engine
  2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
                     ` (2 preceding siblings ...)
  2023-05-25  9:45   ` [PATCH v6 3/4] net/ixgbe: " Feifei Wang
@ 2023-05-25  9:45   ` Feifei Wang
  2023-06-05 13:08     ` Константин Ананьев
  3 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-05-25  9:45 UTC (permalink / raw)
  To: Aman Singh, Yuying Zhang; +Cc: dev, nd, Feifei Wang, Jerin Jacob, Ruifeng Wang

Add recycle mbufs engine for testpmd. This engine forward pkts with
I/O forward mode. But enable mbufs recycle feature to recycle used
txq mbufs for rxq mbuf ring, which can bypass mempool path and save
CPU cycles.

Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 app/test-pmd/meson.build                    |  1 +
 app/test-pmd/recycle_mbufs.c                | 79 +++++++++++++++++++++
 app/test-pmd/testpmd.c                      |  1 +
 app/test-pmd/testpmd.h                      |  3 +
 doc/guides/testpmd_app_ug/run_app.rst       |  1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  5 +-
 6 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/recycle_mbufs.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index d2e3f60892..6e5f067274 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
         'macswap.c',
         'noisy_vnf.c',
         'parameters.c',
+	'recycle_mbufs.c',
         'rxonly.c',
         'shared_rxq_fwd.c',
         'testpmd.c',
diff --git a/app/test-pmd/recycle_mbufs.c b/app/test-pmd/recycle_mbufs.c
new file mode 100644
index 0000000000..0c603c3ec2
--- /dev/null
+++ b/app/test-pmd/recycle_mbufs.c
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <sys/queue.h>
+#include <sys/stat.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_debug.h>
+#include <rte_cycles.h>
+#include <rte_memory.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_branch_prediction.h>
+#include <rte_mbuf.h>
+#include <rte_interrupts.h>
+#include <rte_ether.h>
+#include <rte_ethdev.h>
+
+#include "testpmd.h"
+
+/*
+ * Forwarding of packets in I/O mode.
+ * Enable mbufs recycle mode to recycle txq used mbufs
+ * for rxq mbuf ring. This can bypass mempool path and
+ * save CPU cycles.
+ */
+static bool
+pkt_burst_recycle_mbufs(struct fwd_stream *fs)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	uint16_t nb_rx;
+
+	/* Recycle used mbufs from the txq, and move these mbufs into
+	 * the rxq mbuf ring.
+	 */
+	rte_eth_recycle_mbufs(fs->rx_port, fs->rx_queue,
+			fs->tx_port, fs->tx_queue, &(fs->recycle_rxq_info));
+
+	/*
+	 * Receive a burst of packets and forward them.
+	 */
+	nb_rx = common_fwd_stream_receive(fs, pkts_burst, nb_pkt_per_burst);
+	if (unlikely(nb_rx == 0))
+		return false;
+
+	common_fwd_stream_transmit(fs, pkts_burst, nb_rx);
+
+	return true;
+}
+
+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+	/* Retrieve information about given ports's Rx queue
+	 * for recycling mbufs.
+	 */
+	rte_eth_recycle_rx_queue_info_get(fs->rx_port, fs->rx_queue,
+			&(fs->recycle_rxq_info));
+
+	common_fwd_stream_init(fs);
+}
+
+struct fwd_engine recycle_mbufs_engine = {
+	.fwd_mode_name  = "recycle_mbufs",
+	.stream_init    = recycle_mbufs_stream_init,
+	.packet_fwd     = pkt_burst_recycle_mbufs,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 5cb6f92523..050e48d79a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -199,6 +199,7 @@ struct fwd_engine * fwd_engines[] = {
 	&icmp_echo_engine,
 	&noisy_vnf_engine,
 	&five_tuple_swap_fwd_engine,
+	&recycle_mbufs_engine,
 #ifdef RTE_LIBRTE_IEEE1588
 	&ieee1588_fwd_engine,
 #endif
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index bdfbfd36d3..34e72fd7d5 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -179,6 +179,8 @@ struct fwd_stream {
 	struct pkt_burst_stats rx_burst_stats;
 	struct pkt_burst_stats tx_burst_stats;
 	struct fwd_lcore *lcore; /**< Lcore being scheduled. */
+	/**< Rx queue information for recycling mbufs */
+	struct rte_eth_recycle_rxq_info recycle_rxq_info;
 };
 
 /**
@@ -432,6 +434,7 @@ extern struct fwd_engine csum_fwd_engine;
 extern struct fwd_engine icmp_echo_engine;
 extern struct fwd_engine noisy_vnf_engine;
 extern struct fwd_engine five_tuple_swap_fwd_engine;
+extern struct fwd_engine recycle_mbufs_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 57b23241cf..cbc68acc36 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -232,6 +232,7 @@ The command line options are:
        noisy
        5tswap
        shared-rxq
+       recycle_mbufs
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 8f23847859..482e583263 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -318,7 +318,7 @@ set fwd
 Set the packet forwarding mode::
 
    testpmd> set fwd (io|mac|macswap|flowgen| \
-                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
+                     rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq|recycle_mbufs) (""|retry)
 
 ``retry`` can be specified for forwarding engines except ``rx_only``.
 
@@ -364,6 +364,9 @@ The available information categories are:
 * ``shared-rxq``: Receive only for shared Rx queue.
   Resolve packet source port from mbuf and update stream statistics accordingly.
 
+* ``recycle_mbufs``:  Recycle Tx queue used mbufs for Rx queue mbuf ring.
+  This mode uses fast path mbuf recycle feature and forwards packets in I/O mode.
+
 Example::
 
    testpmd> set fwd rxonly
-- 
2.25.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-05-25 15:08     ` Morten Brørup
  2023-05-31  6:10       ` Feifei Wang
  2023-06-05 12:53     ` Константин Ананьев
  1 sibling, 1 reply; 67+ messages in thread
From: Morten Brørup @ 2023-05-25 15:08 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Thursday, 25 May 2023 11.46
> 
> Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
> APIs to recycle used mbufs from a transmit queue of an Ethernet device,
> and move these mbufs into a mbuf ring for a receive queue of an Ethernet
> device. This can bypass mempool 'put/get' operations hence saving CPU
> cycles.
> 
> For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
> the following operations:
> - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
> ring.
> - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
> from the Tx mbuf ring.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---

[...]

> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 2c9d615fb5..c6723d5277 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>  	eth_rx_descriptor_status_t rx_descriptor_status;
>  	/** Check the status of a Tx descriptor */
>  	eth_tx_descriptor_status_t tx_descriptor_status;
> +	/** Pointer to PMD transmit mbufs reuse function */
> +	eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> +	/** Pointer to PMD receive descriptors refill function */
> +	eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> 
>  	/**
>  	 * Device data that is shared between primary and secondary processes

The rte_eth_dev struct currently looks like this:

/**
 * @internal
 * The generic data structure associated with each Ethernet device.
 *
 * Pointers to burst-oriented packet receive and transmit functions are
 * located at the beginning of the structure, along with the pointer to
 * where all the data elements for the particular device are stored in shared
 * memory. This split allows the function pointer and driver data to be per-
 * process, while the actual configuration data for the device is shared.
 */
struct rte_eth_dev {
	eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function */
	eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function */

	/** Pointer to PMD transmit prepare function */
	eth_tx_prep_t tx_pkt_prepare;
	/** Get the number of used Rx descriptors */
	eth_rx_queue_count_t rx_queue_count;
	/** Check the status of a Rx descriptor */
	eth_rx_descriptor_status_t rx_descriptor_status;
	/** Check the status of a Tx descriptor */
	eth_tx_descriptor_status_t tx_descriptor_status;

	/**
	 * Device data that is shared between primary and secondary processes
	 */
	struct rte_eth_dev_data *data;
	void *process_private; /**< Pointer to per-process device data */
	const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
	struct rte_device *device; /**< Backing device */
	struct rte_intr_handle *intr_handle; /**< Device interrupt handle */

	/** User application callbacks for NIC interrupts */
	struct rte_eth_dev_cb_list link_intr_cbs;
	/**
	 * User-supplied functions called from rx_burst to post-process
	 * received packets before passing them to the user
	 */
	struct rte_eth_rxtx_callback *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
	/**
	 * User-supplied functions called from tx_burst to pre-process
	 * received packets before passing them to the driver for transmission
	 */
	struct rte_eth_rxtx_callback *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];

	enum rte_eth_dev_state state; /**< Flag indicating the port state */
	void *security_ctx; /**< Context for security ops */
} __rte_cache_aligned;

Inserting the two new function pointers (recycle_tx_mbufs_reuse and recycle_rx_descriptors_refill) as the 7th and 8th fields will move the 'data' and 'process_private' pointers out of the first cache line.

If those data pointers are used in the fast path with the rx_pkt_burst and tx_pkt_burst functions, moving them to a different cache line might have a performance impact on those two functions.

Disclaimer: This is a big "if", and wild speculation from me, because I haven't looked at it in detail! If this structure is not used in the fast path like this, you can ignore my suggestion below.

Please consider moving the 'data' and 'process_private' pointers to the beginning of this structure, so they are kept in the same cache line as the rx_pkt_burst and tx_pkt_burst function pointers.

I don't know the relative importance of the remaining six fast path functions (the four existing ones plus the two new ones in this patch), so you could also rearrange those, so the least important two functions are moved out of the first cache line. It doesn't have to be the two recycle functions that go into a different cache line.

-Morten


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-05-25 15:08     ` Morten Brørup
@ 2023-05-31  6:10       ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-05-31  6:10 UTC (permalink / raw)
  To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, May 25, 2023 11:09 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Ferruh
> Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Thursday, 25 May 2023 11.46
> >
> > Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
> > APIs to recycle used mbufs from a transmit queue of an Ethernet
> > device, and move these mbufs into a mbuf ring for a receive queue of
> > an Ethernet device. This can bypass mempool 'put/get' operations hence
> > saving CPU cycles.
> >
> > For each recycling mbufs, the rte_eth_recycle_mbufs() function
> > performs the following operations:
> > - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
> > ring.
> > - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs
> > freed from the Tx mbuf ring.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> 
> [...]
> 
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 2c9d615fb5..c6723d5277 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> >  	eth_rx_descriptor_status_t rx_descriptor_status;
> >  	/** Check the status of a Tx descriptor */
> >  	eth_tx_descriptor_status_t tx_descriptor_status;
> > +	/** Pointer to PMD transmit mbufs reuse function */
> > +	eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> > +	/** Pointer to PMD receive descriptors refill function */
> > +	eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> >
> >  	/**
> >  	 * Device data that is shared between primary and secondary
> > processes
> 
> The rte_eth_dev struct currently looks like this:
> 
> /**
>  * @internal
>  * The generic data structure associated with each Ethernet device.
>  *
>  * Pointers to burst-oriented packet receive and transmit functions are
>  * located at the beginning of the structure, along with the pointer to
>  * where all the data elements for the particular device are stored in shared
>  * memory. This split allows the function pointer and driver data to be per-
>  * process, while the actual configuration data for the device is shared.
>  */
> struct rte_eth_dev {
> 	eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function */
> 	eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function */
> 
> 	/** Pointer to PMD transmit prepare function */
> 	eth_tx_prep_t tx_pkt_prepare;
> 	/** Get the number of used Rx descriptors */
> 	eth_rx_queue_count_t rx_queue_count;
> 	/** Check the status of a Rx descriptor */
> 	eth_rx_descriptor_status_t rx_descriptor_status;
> 	/** Check the status of a Tx descriptor */
> 	eth_tx_descriptor_status_t tx_descriptor_status;
> 
> 	/**
> 	 * Device data that is shared between primary and secondary
> processes
> 	 */
> 	struct rte_eth_dev_data *data;
> 	void *process_private; /**< Pointer to per-process device data */
> 	const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD
> */
> 	struct rte_device *device; /**< Backing device */
> 	struct rte_intr_handle *intr_handle; /**< Device interrupt handle */
> 
> 	/** User application callbacks for NIC interrupts */
> 	struct rte_eth_dev_cb_list link_intr_cbs;
> 	/**
> 	 * User-supplied functions called from rx_burst to post-process
> 	 * received packets before passing them to the user
> 	 */
> 	struct rte_eth_rxtx_callback
> *post_rx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> 	/**
> 	 * User-supplied functions called from tx_burst to pre-process
> 	 * received packets before passing them to the driver for transmission
> 	 */
> 	struct rte_eth_rxtx_callback
> *pre_tx_burst_cbs[RTE_MAX_QUEUES_PER_PORT];
> 
> 	enum rte_eth_dev_state state; /**< Flag indicating the port state */
> 	void *security_ctx; /**< Context for security ops */ }
> __rte_cache_aligned;
> 
> Inserting the two new function pointers (recycle_tx_mbufs_reuse and
> recycle_rx_descriptors_refill) as the 7th and 8th fields will move the 'data' and
> 'process_private' pointers out of the first cache line.
> 
> If those data pointers are used in the fast path with the rx_pkt_burst and
> tx_pkt_burst functions, moving them to a different cache line might have a
> performance impact on those two functions.
> 
> Disclaimer: This is a big "if", and wild speculation from me, because I haven't
> looked at it in detail! If this structure is not used in the fast path like this, you
> can ignore my suggestion below.
> 
> Please consider moving the 'data' and 'process_private' pointers to the
> beginning of this structure, so they are kept in the same cache line as the
> rx_pkt_burst and tx_pkt_burst function pointers.
> 
> I don't know the relative importance of the remaining six fast path functions
> (the four existing ones plus the two new ones in this patch), so you could also
> rearrange those, so the least important two functions are moved out of the
> first cache line. It doesn't have to be the two recycle functions that go into a
> different cache line.
> 
> -Morten

This is a good question~. By reviewing the code, we find the pointers which are used for fast path
can be mapped to  structure 'rte_eth_fp_ops *fpo', this ensures all fast path pointers are in the same
Rx/Tx cacheline

void
eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
		const struct rte_eth_dev *dev)
{
	fpo->rx_pkt_burst = dev->rx_pkt_burst;
	fpo->tx_pkt_burst = dev->tx_pkt_burst;
	fpo->tx_pkt_prepare = dev->tx_pkt_prepare;
	fpo->rx_queue_count = dev->rx_queue_count;
	fpo->rx_descriptor_status = dev->rx_descriptor_status;
	fpo->tx_descriptor_status = dev->tx_descriptor_status;
	fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
	fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;

	fpo->rxq.data = dev->data->rx_queues;
	fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;

	fpo->txq.data = dev->data->tx_queues;
	fpo->txq.clbk = (void **)(uintptr_t)dev->pre_tx_burst_cbs;
}

Besides rx_queues and tx_queues pointer are important for fast path,  other members of
'data' and 'process_private' are for slow path. So it is not necessary for these members to be
in the cacheline. 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
  2023-05-25 15:08     ` Morten Brørup
@ 2023-06-05 12:53     ` Константин Ананьев
  2023-06-06  2:55       ` Feifei Wang
  1 sibling, 1 reply; 67+ messages in thread
From: Константин Ананьев @ 2023-06-05 12:53 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

[-- Attachment #1: Type: text/html, Size: 18994 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-05-25  9:45   ` [PATCH v6 2/4] net/i40e: implement " Feifei Wang
@ 2023-06-05 13:02     ` Константин Ананьев
  2023-06-06  3:16       ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Константин Ананьев @ 2023-06-05 13:02 UTC (permalink / raw)
  To: Feifei Wang, Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

[-- Attachment #1: Type: text/html, Size: 15029 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 4/4] app/testpmd: add recycle mbufs engine
  2023-05-25  9:45   ` [PATCH v6 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
@ 2023-06-05 13:08     ` Константин Ананьев
  2023-06-06  6:32       ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Константин Ананьев @ 2023-06-05 13:08 UTC (permalink / raw)
  To: Feifei Wang, Aman Singh, Yuying Zhang; +Cc: dev, nd, Jerin Jacob, Ruifeng Wang

[-- Attachment #1: Type: text/html, Size: 7448 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-05 12:53     ` Константин Ананьев
@ 2023-06-06  2:55       ` Feifei Wang
  2023-06-06  7:10         ` Konstantin Ananyev
  0 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-06-06  2:55 UTC (permalink / raw)
  To: Константин
	Ананьев,
	thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

[-- Attachment #1: Type: text/plain, Size: 17652 bytes --]

Thanks for the comments.

From: Константин Ананьев <konstantin.v.ananyev@yandex.ru>
Sent: Monday, June 5, 2023 8:54 PM
To: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
Subject: Re: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode



Hi Feifei,

few more comments from me, see below.

Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.

For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com<mailto:honnappa.nagarahalli@arm.com>>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com<mailto:ruifeng.wang@arm.com>>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com<mailto:feifei.wang2@arm.com>>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com<mailto:ruifeng.wang@arm.com>>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com<mailto:honnappa.nagarahalli@arm.com>>
---
 doc/guides/rel_notes/release_23_07.rst | 7 +
 lib/ethdev/ethdev_driver.h | 10 ++
 lib/ethdev/ethdev_private.c | 2 +
 lib/ethdev/rte_ethdev.c | 31 +++++
 lib/ethdev/rte_ethdev.h | 182 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev_core.h | 15 +-
 lib/ethdev/version.map | 4 +
 7 files changed, 249 insertions(+), 2 deletions(-)

diff --git a/doc/guides/rel_notes/release_23_07.rst b/doc/guides/rel_notes/release_23_07.rst
index a9b1293689..f279036cb9 100644
--- a/doc/guides/rel_notes/release_23_07.rst
+++ b/doc/guides/rel_notes/release_23_07.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================

+* **Add mbufs recycling support. **
+ Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+ APIs which allow the user to copy used mbufs from the Tx mbuf ring
+ into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+ device is different from the Tx Ethernet device with respective driver
+ callback functions in ``rte_eth_recycle_mbufs``.
+

 Removed Items
 -------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 2c9d615fb5..c6723d5277 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@ struct rte_eth_dev {
         eth_rx_descriptor_status_t rx_descriptor_status;
         /** Check the status of a Tx descriptor */
         eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Pointer to PMD transmit mbufs reuse function */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ /** Pointer to PMD receive descriptors refill function */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;

         /**
          * Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
         uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);

+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
         uint16_t queue_id, struct rte_eth_burst_mode *mode);

@@ -1247,6 +1255,8 @@ struct eth_dev_ops {
         eth_rxq_info_get_t rxq_info_get;
         /** Retrieve Tx queue information */
         eth_txq_info_get_t txq_info_get;
+ /** Retrieve mbufs recycle Rx queue information */
+ eth_recycle_rxq_info_get_t recycle_rxq_info_get;
         eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
         eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
         eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
         fpo->rx_queue_count = dev->rx_queue_count;
         fpo->rx_descriptor_status = dev->rx_descriptor_status;
         fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+ fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;

         fpo->rxq.data = dev->data->rx_queues;
         fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 4d03255683..7c27dcfea4 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5784,6 +5784,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
         return 0;
 }

+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+ return 0;
+}
+
 int
 rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
                           struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 99fe9e238b..7434aa2483 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
         uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
 } __rte_cache_min_aligned;

+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+ struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+ struct rte_mempool *mp; /**< mempool of Rx queue. */
+ uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
+ uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
+ uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
+ /**
+ * Requirement on mbuf refilling batch size of Rx mbuf ring.
+ * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+ * should be aligned with mbuf ring size, in order to simplify
+ * ring wrapping around.
+ * Value 0 means that PMD drivers have no requirement for this.
+ */
+ uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
 /* Generic Burst mode flag definition, values can be ORed. */

 /**
@@ -4809,6 +4833,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
 int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
         struct rte_eth_txq_info *qinfo);

+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet devicefor which information
+ * will be retrieved.
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+ uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 /**
  * Retrieve information about the Rx packet burst mode.
  *
@@ -6483,6 +6532,139 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
         return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
 }

+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ * from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can only support one-time pairing
+ * between the receive queue and transmit queue. Do not pair one receive queue with
+ * multiple transmit queues or pair one transmit queue with multiple receive queues,
+ * in order to avoid memory error rewriting.
Probably I am missing something, but why it is not possible to do something like that:

rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=M, ...);
....
rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=K, ...);

I.E. feed rx queue from 2 tx queues?

Two problems for this:

  1.  If we have 2 tx queues for rx, the thread should make the extra judgement to

decide which one to choose in the driver layer.

On the other hand, current mechanism can support users to switch 1 txq to another timely

in the application layer. If user want to choose another txq, he just need to change the txq_queue_id parameter

in the API.

  1.  If you want one rxq to support two txq at the same time, this needs to add spinlock on guard variable to

avoid multi-thread conflict. Spinlock will decrease the data-path performance greatly.  Thus, we do not consider

1 rxq mapping multiple txqs here.

+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ * the information of the Rx queue mbuf ring.
+ * @return
+ * The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+ uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (tx_port_id >= RTE_MAX_ETHPORTS ||
+ tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid tx_port_id=%u or tx_queue_id=%u\n",
+ tx_port_id, tx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[tx_port_id];
+ qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ tx_queue_id, tx_port_id);
+ return 0;
+ }
+#endif
+ if (p->recycle_tx_mbufs_reuse == NULL)
+ return 0;
+
+ /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+ * into Rx mbuf ring.
+ */
+ nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+ /* If no recycling mbufs, return 0. */
+ if (nb_mbufs == 0)
+ return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (rx_port_id >= RTE_MAX_ETHPORTS ||
+ rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+ rx_port_id, rx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[rx_port_id];
+ qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ rx_queue_id, rx_port_id);
+ return 0;
+ }
+#endif
+
+ if (p->recycle_rx_descriptors_refill == NULL)
+ return 0;
+
+ /* Replenish the Rx descriptors with the recycling
+ * into Rx mbuf ring.
+ */
+ p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+ return nb_mbufs;
+}
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..a2e6ea6b6c 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
 /** @internal Check the status of a Tx descriptor */
 typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);

+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
 /**
  * @internal
  * Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,9 +97,11 @@ struct rte_eth_fp_ops {
         eth_rx_queue_count_t rx_queue_count;
         /** Check the status of a Rx descriptor. */
         eth_rx_descriptor_status_t rx_descriptor_status;
+ /** Refill Rx descriptors with the recycling mbufs. */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
I am afraid we can't put new fields here without ABI breakage.

Agree

It has to be below rxq.
Now thinking about current layout probably not the best one,
and when introducing this struct, I should probably put rxq either
on the top of the struct, or on the next cache line.
But such change is not possible right now anyway.
Same story for txq.

Thus we should rearrange the structure like below:
struct rte_eth_fp_ops {
    struct rte_ethdev_qdata rxq;
         eth_rx_burst_t rx_pkt_burst;
         eth_rx_queue_count_t rx_queue_count;
         eth_rx_descriptor_status_t rx_descriptor_status;
       eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
              uintptr_t reserved1[2];
}



         /** Rx queues data. */
         struct rte_ethdev_qdata rxq;
- uintptr_t reserved1[3];
+ uintptr_t reserved1[2];
         /**@}*/

         /**@{*/
@@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
         eth_tx_prep_t tx_pkt_prepare;
         /** Check the status of a Tx descriptor. */
         eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Copy used mbufs from Tx mbuf ring into Rx. */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
         /** Tx queues data. */
         struct rte_ethdev_qdata txq;
- uintptr_t reserved2[3];
+ uintptr_t reserved2[2];
         /**@}*/

 } __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 357d1a88c0..45c417f6bd 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -299,6 +299,10 @@ EXPERIMENTAL {
         rte_flow_action_handle_query_update;
         rte_flow_async_action_handle_query_update;
         rte_flow_async_create_by_index;
+
+ # added in 23.07
+ rte_eth_recycle_mbufs;
+ rte_eth_recycle_rx_queue_info_get;
 };

 INTERNAL {
--
2.25.1


[-- Attachment #2: Type: text/html, Size: 29050 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-06-05 13:02     ` Константин Ананьев
@ 2023-06-06  3:16       ` Feifei Wang
  2023-06-06  7:18         ` Konstantin Ananyev
  0 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-06-06  3:16 UTC (permalink / raw)
  To: Константин
	Ананьев,
	Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

[-- Attachment #1: Type: text/plain, Size: 12713 bytes --]



From: Константин Ананьев <konstantin.v.ananyev@yandex.ru>
Sent: Monday, June 5, 2023 9:03 PM
To: Feifei Wang <Feifei.Wang2@arm.com>; Yuying Zhang <yuying.zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>
Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
Subject: Re: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode





Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com<mailto:honnappa.nagarahalli@arm.com>>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com<mailto:feifei.wang2@arm.com>>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com<mailto:ruifeng.wang@arm.com>>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com<mailto:honnappa.nagarahalli@arm.com>>
---
 drivers/net/i40e/i40e_ethdev.c | 1 +
 drivers/net/i40e/i40e_ethdev.h | 2 +
 .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140 ++++++++++++++++++
 drivers/net/i40e/i40e_rxtx.c | 32 ++++
 drivers/net/i40e/i40e_rxtx.h | 4 +
 drivers/net/i40e/meson.build | 2 +
 6 files changed, 181 insertions(+)
 create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index f9d8f9791f..d4eecd16cf 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
         .flow_ops_get = i40e_dev_flow_ops_get,
         .rxq_info_get = i40e_rxq_info_get,
         .txq_info_get = i40e_txq_info_get,
+ .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
         .rx_burst_mode_get = i40e_rx_burst_mode_get,
         .tx_burst_mode_get = i40e_tx_burst_mode_get,
         .timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 9b806d130e..b5b2d6cf2b 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
         struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
         struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
 int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
                            struct rte_eth_burst_mode *mode);
 int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..08d708fd7d
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,140 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* flush desc with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = 0;
+ rx_id = rxq->nb_rx_desc - 1;
+ }
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+ int i, j, n;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+
+ /* Mbufs recycle mode can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * mbuf ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with mbuf ring size. In this case, the update of refill head can not
+ * exceed the Rx mbuf ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++, rxep++, txep++)
+ *rxep = txep[0].mbuf;
+ } else {
+ for (i = 0, j = 0; i < n; i++) {
+ /* Avoid txq contains buffers from expected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[i].mbuf->pool))
+ return 0;
I don't think that it is possible to simply return 0 here:
we might already have some mbufs inside rxep[], so we probably need
to return that number (j).

No, here is just pre-free, not actually put mbufs into rxeq.
After run out of the loop, we call rte_memcpy to actually copy
mbufs into rxep.

+
+ m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* In case 1, each of Tx buffers should be the
+ * last reference.
+ */
+ if (unlikely(m[j] == NULL && refill_requirement))
+ return 0;

same here, we can't simply return 0, it will introduce mbuf leakage.

+ /* In case 2, the number of valid Tx free
+ * buffers should be recorded.
+ */
+ j++;
+ }
+ rte_memcpy(rxep, m, sizeof(void *) * j);
Wonder why do you need intermediate buffer for released mbufs?
Why can't just:
...
m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
...
rxep[j++] = m;
?
Might save you few extra cycles.
Sometimes ‘rte_pktmbuf_prefree_seg’ can return NULL due to
mbuf->refcnt > 1. So we should firstly ensure all ‘m’ are valid and
then copy them into rxep.

+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return n;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..53cf787f04 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3197,6 +3197,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
         qinfo->conf.offloads = txq->offloads;
 }

+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_rx_queue *rxq;
+ struct i40e_adapter *ad =
+ I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mp;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (ad->rx_vec_allowed) {
+ recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
 #ifdef RTE_ARCH_X86
 static inline bool
 get_avx_supported(bool request_avx512)
@@ -3291,6 +3315,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
                                 dev->rx_pkt_burst = ad->rx_use_avx2 ?
                                         i40e_recv_scattered_pkts_vec_avx2 :
                                         i40e_recv_scattered_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
                         }
                 } else {
                         if (ad->rx_use_avx512) {
@@ -3309,9 +3335,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
                                 dev->rx_pkt_burst = ad->rx_use_avx2 ?
                                         i40e_recv_pkts_vec_avx2 :
                                         i40e_recv_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
                         }
                 }
 #else /* RTE_ARCH_X86 */
+ dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
                 if (dev->data->scattered_rx) {
                         PMD_INIT_LOG(DEBUG,
                                      "Using Vector Scattered Rx (port %d).",
@@ -3479,15 +3508,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
                                 dev->tx_pkt_burst = ad->tx_use_avx2 ?
                                                     i40e_xmit_pkts_vec_avx2 :
                                                     i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
                         }
 #else /* RTE_ARCH_X86 */
                         PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
                                      dev->data->port_id);
                         dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
 #endif /* RTE_ARCH_X86 */
                 } else {
                         PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
                         dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
                 }
                 dev->tx_pkt_prepare = i40e_simple_prep_pkts;
         } else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..ed8921ddc0 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
 int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
 int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);

+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
 uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
                             uint16_t nb_pkts);
 uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..58eb627abc 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -42,6 +42,8 @@ testpmd_sources = files('i40e_testpmd.c')
 deps += ['hash']
 includes += include_directories('base')

+sources += files('i40e_recycle_mbufs_vec_common.c')
+
 if arch_subdir == 'x86'
     sources += files('i40e_rxtx_vec_sse.c')

--
2.25.1


[-- Attachment #2: Type: text/html, Size: 22765 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 4/4] app/testpmd: add recycle mbufs engine
  2023-06-05 13:08     ` Константин Ананьев
@ 2023-06-06  6:32       ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-06-06  6:32 UTC (permalink / raw)
  To: Константин
	Ананьев,
	Aman Singh, Yuying Zhang
  Cc: dev, nd, Jerin Jacob, Ruifeng Wang, nd

[-- Attachment #1: Type: text/plain, Size: 922 bytes --]



From: Константин Ананьев <konstantin.v.ananyev@yandex.ru>
Sent: Monday, June 5, 2023 9:08 PM
To: Feifei Wang <Feifei.Wang2@arm.com>; Aman Singh <aman.deep.singh@intel.com>; Yuying Zhang <yuying.zhang@intel.com>
Cc: dev@dpdk.org; nd <nd@arm.com>; Jerin Jacob <jerinjacobk@gmail.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
Subject: Re: [PATCH v6 4/4] app/testpmd: add recycle mbufs engine

[…]

+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+ /* Retrieve information about given ports's Rx queue
+ * for recycling mbufs.
+ */
+ rte_eth_recycle_rx_queue_info_get(fs->rx_port, fs->rx_queue,
+ &(fs->recycle_rxq_info));

[Konstantin] We probably should check the return status and complain about failure.
[Feifei]  Agree. However, testpmd ‘stream_init’ function return value is void. Here,
If errors happen, what we can do is to only complain about failure.



[-- Attachment #2: Type: text/html, Size: 3381 bytes --]

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-06  2:55       ` Feifei Wang
@ 2023-06-06  7:10         ` Konstantin Ananyev
  2023-06-06  7:31           ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Konstantin Ananyev @ 2023-06-06  7:10 UTC (permalink / raw)
  To: Feifei Wang,
	Константин
	Ананьев,
	thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

> 
> Thanks for the comments.
> 
> From: Константин Ананьев <mailto:konstantin.v.ananyev@yandex.ru>
> Sent: Monday, June 5, 2023 8:54 PM
> To: Feifei Wang <mailto:Feifei.Wang2@arm.com>; mailto:thomas@monjalon.net; Ferruh Yigit <mailto:ferruh.yigit@amd.com>;
> Andrew Rybchenko <mailto:andrew.rybchenko@oktetlabs.ru>
> Cc: mailto:dev@dpdk.org; nd <mailto:nd@arm.com>; Honnappa Nagarahalli <mailto:Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <mailto:Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
> 
> 
> 
> Hi Feifei,
> 
> few more comments from me, see below.
> Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
> APIs to recycle used mbufs from a transmit queue of an Ethernet device,
> and move these mbufs into a mbuf ring for a receive queue of an Ethernet
> device. This can bypass mempool 'put/get' operations hence saving CPU
> cycles.
> 
> For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
> the following operations:
> - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
> ring.
> - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
> from the Tx mbuf ring.
> 
> Suggested-by: Honnappa Nagarahalli <mailto:honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <mailto:feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <mailto:honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/rel_notes/release_23_07.rst | 7 +
>  lib/ethdev/ethdev_driver.h | 10 ++
>  lib/ethdev/ethdev_private.c | 2 +
>  lib/ethdev/rte_ethdev.c | 31 +++++
>  lib/ethdev/rte_ethdev.h | 182 +++++++++++++++++++++++++
>  lib/ethdev/rte_ethdev_core.h | 15 +-
>  lib/ethdev/version.map | 4 +
>  7 files changed, 249 insertions(+), 2 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_23_07.rst b/doc/guides/rel_notes/release_23_07.rst
> index a9b1293689..f279036cb9 100644
> --- a/doc/guides/rel_notes/release_23_07.rst
> +++ b/doc/guides/rel_notes/release_23_07.rst
> @@ -55,6 +55,13 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Add mbufs recycling support. **
> + Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
> + APIs which allow the user to copy used mbufs from the Tx mbuf ring
> + into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
> + device is different from the Tx Ethernet device with respective driver
> + callback functions in ``rte_eth_recycle_mbufs``.
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 2c9d615fb5..c6723d5277 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>          eth_rx_descriptor_status_t rx_descriptor_status;
>          /** Check the status of a Tx descriptor */
>          eth_tx_descriptor_status_t tx_descriptor_status;
> + /** Pointer to PMD transmit mbufs reuse function */
> + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> + /** Pointer to PMD receive descriptors refill function */
> + eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> 
>          /**
>           * Device data that is shared between primary and secondary processes
> @@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
>  typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
>          uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
> 
> +typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> +
>  typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
>          uint16_t queue_id, struct rte_eth_burst_mode *mode);
> 
> @@ -1247,6 +1255,8 @@ struct eth_dev_ops {
>          eth_rxq_info_get_t rxq_info_get;
>          /** Retrieve Tx queue information */
>          eth_txq_info_get_t txq_info_get;
> + /** Retrieve mbufs recycle Rx queue information */
> + eth_recycle_rxq_info_get_t recycle_rxq_info_get;
>          eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
>          eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
>          eth_fw_version_get_t fw_version_get; /**< Get firmware version */
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 14ec8c6ccf..f8ab64f195 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
>          fpo->rx_queue_count = dev->rx_queue_count;
>          fpo->rx_descriptor_status = dev->rx_descriptor_status;
>          fpo->tx_descriptor_status = dev->tx_descriptor_status;
> + fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
> + fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
> 
>          fpo->rxq.data = dev->data->rx_queues;
>          fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 4d03255683..7c27dcfea4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -5784,6 +5784,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>          return 0;
>  }
> 
> +int
> +rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info)
> +{
> + struct rte_eth_dev *dev;
> +
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> + dev = &rte_eth_devices[port_id];
> +
> + if (queue_id >= dev->data->nb_rx_queues) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
> + return -EINVAL;
> + }
> +
> + if (dev->data->rx_queues == NULL ||
> + dev->data->rx_queues[queue_id] == NULL) {
> + RTE_ETHDEV_LOG(ERR,
> + "Rx queue %"PRIu16" of device with port_id=%"
> + PRIu16" has not been setup\n",
> + queue_id, port_id);
> + return -EINVAL;
> + }
> +
> + if (*dev->dev_ops->recycle_rxq_info_get == NULL)
> + return -ENOTSUP;
> +
> + dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
> +
> + return 0;
> +}
> +
>  int
>  rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
>                            struct rte_eth_burst_mode *mode)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 99fe9e238b..7434aa2483 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
>          uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
>  } __rte_cache_min_aligned;
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this structure may change without prior notice.
> + *
> + * Ethernet device Rx queue information structure for recycling mbufs.
> + * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
> + * them into Rx mbuf ring.
> + */
> +struct rte_eth_recycle_rxq_info {
> + struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
> + struct rte_mempool *mp; /**< mempool of Rx queue. */
> + uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
> + uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
> + uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
> + /**
> + * Requirement on mbuf refilling batch size of Rx mbuf ring.
> + * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
> + * should be aligned with mbuf ring size, in order to simplify
> + * ring wrapping around.
> + * Value 0 means that PMD drivers have no requirement for this.
> + */
> + uint16_t refill_requirement;
> +} __rte_cache_min_aligned;
> +
>  /* Generic Burst mode flag definition, values can be ORed. */
> 
>  /**
> @@ -4809,6 +4833,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>  int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
>          struct rte_eth_txq_info *qinfo);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Retrieve information about given ports's Rx queue for recycling mbufs.
> + *
> + * @param port_id
> + * The port identifier of the Ethernet device.
> + * @param queue_id
> + * The Rx queue on the Ethernet devicefor which information
> + * will be retrieved.
> + * @param recycle_rxq_info
> + * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
> + *
> + * @return
> + * - 0: Success
> + * - -ENODEV: If *port_id* is invalid.
> + * - -ENOTSUP: routine is not supported by the device PMD.
> + * - -EINVAL: The queue_id is out of range.
> + */
> +__rte_experimental
> +int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
> + uint16_t queue_id,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> +
>  /**
>   * Retrieve information about the Rx packet burst mode.
>   *
> @@ -6483,6 +6532,139 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
>          return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
>  }
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Recycle used mbufs from a transmit queue of an Ethernet device, and move
> + * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
> + * This can bypass mempool path to save CPU cycles.
> + *
> + * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
> + * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
> + * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
> + * ring, with the constraint of enough used mbufs from Tx mbuf ring.
> + *
> + * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
> + * following operations:
> + *
> + * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
> + *
> + * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
> + * from the Tx mbuf ring.
> + *
> + * This function spilts Rx and Tx path with different callback functions. The
> + * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
> + * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
> + * can support the case that Rx Ethernet device is different from Tx Ethernet device.
> + *
> + * It is the responsibility of users to select the Rx/Tx queue pair to recycle
> + * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
> + * function to retrieve selected Rx queue information.
> + * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
> + *
> + * Currently, the rte_eth_recycle_mbufs() function can only support one-time pairing
> + * between the receive queue and transmit queue. Do not pair one receive queue with
> + * multiple transmit queues or pair one transmit queue with multiple receive queues,
> + * in order to avoid memory error rewriting.
> Probably I am missing something, but why it is not possible to do something like that:
> 
> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=M, ...);
> ....
> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=K, ...);
> 
> I.E. feed rx queue from 2 tx queues?
> 
> Two problems for this:
> 1. If we have 2 tx queues for rx, the thread should make the extra judgement to
> decide which one to choose in the driver layer.

Not sure, why on the driver layer?
The example I gave above - decision is made on application layer.
Lets say first call didn't free enough mbufs, so app decided to use second txq for rearm.

> On the other hand, current mechanism can support users to switch 1 txq to another timely
> in the application layer. If user want to choose another txq, he just need to change the txq_queue_id parameter
> in the API.
> 2. If you want one rxq to support two txq at the same time, this needs to add spinlock on guard variable to
> avoid multi-thread conflict. Spinlock will decrease the data-path performance greatly.  Thus, we do not consider
> 1 rxq mapping multiple txqs here.
 
I am talking about situation when one thread controls 2 tx queues.

> + *
> + * @param rx_port_id
> + * Port identifying the receive side.
> + * @param rx_queue_id
> + * The index of the receive queue identifying the receive side.
> + * The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param tx_port_id
> + * Port identifying the transmit side.
> + * @param tx_queue_id
> + * The index of the transmit queue identifying the transmit side.
> + * The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param recycle_rxq_info
> + * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
> + * the information of the Rx queue mbuf ring.
> + * @return
> + * The number of recycling mbufs.
> + */
> +__rte_experimental
> +static inline uint16_t
> +rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info)
> +{
> + struct rte_eth_fp_ops *p;
> + void *qd;
> + uint16_t nb_mbufs;
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> + if (tx_port_id >= RTE_MAX_ETHPORTS ||
> + tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> + RTE_ETHDEV_LOG(ERR,
> + "Invalid tx_port_id=%u or tx_queue_id=%u\n",
> + tx_port_id, tx_queue_id);
> + return 0;
> + }
> +#endif
> +
> + /* fetch pointer to queue data */
> + p = &rte_eth_fp_ops[tx_port_id];
> + qd = p->txq.data[tx_queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
> +
> + if (qd == NULL) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> + tx_queue_id, tx_port_id);
> + return 0;
> + }
> +#endif
> + if (p->recycle_tx_mbufs_reuse == NULL)
> + return 0;
> +
> + /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
> + * into Rx mbuf ring.
> + */
> + nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
> +
> + /* If no recycling mbufs, return 0. */
> + if (nb_mbufs == 0)
> + return 0;
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> + if (rx_port_id >= RTE_MAX_ETHPORTS ||
> + rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> + RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
> + rx_port_id, rx_queue_id);
> + return 0;
> + }
> +#endif
> +
> + /* fetch pointer to queue data */
> + p = &rte_eth_fp_ops[rx_port_id];
> + qd = p->rxq.data[rx_queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
> +
> + if (qd == NULL) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> + rx_queue_id, rx_port_id);
> + return 0;
> + }
> +#endif
> +
> + if (p->recycle_rx_descriptors_refill == NULL)
> + return 0;
> +
> + /* Replenish the Rx descriptors with the recycling
> + * into Rx mbuf ring.
> + */
> + p->recycle_rx_descriptors_refill(qd, nb_mbufs);
> +
> + return nb_mbufs;
> +}
> +
>  /**
>   * @warning
>   * @b EXPERIMENTAL: this API may change without prior notice
> diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> index dcf8adab92..a2e6ea6b6c 100644
> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
>  /** @internal Check the status of a Tx descriptor */
>  typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> 
> +/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
> +typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> +
> +/** @internal Refill Rx descriptors with the recycling mbufs */
> +typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
> +
>  /**
>   * @internal
>   * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> @@ -90,9 +97,11 @@ struct rte_eth_fp_ops {
>          eth_rx_queue_count_t rx_queue_count;
>          /** Check the status of a Rx descriptor. */
>          eth_rx_descriptor_status_t rx_descriptor_status;
> + /** Refill Rx descriptors with the recycling mbufs. */
> + eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> I am afraid we can't put new fields here without ABI breakage.
> 
> Agree
> 
> It has to be below rxq.
> Now thinking about current layout probably not the best one,
> and when introducing this struct, I should probably put rxq either
> on the top of the struct, or on the next cache line.
> But such change is not possible right now anyway.
> Same story for txq.
> 
> Thus we should rearrange the structure like below:
> struct rte_eth_fp_ops {
>     struct rte_ethdev_qdata rxq;
>          eth_rx_burst_t rx_pkt_burst;
>          eth_rx_queue_count_t rx_queue_count;
>          eth_rx_descriptor_status_t rx_descriptor_status;
>        eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
>               uintptr_t reserved1[2];
> }

Yes, I think such layout will be better.
The only problem here - we have to wait for 23.11 for that. 

> 
> 
>          /** Rx queues data. */
>          struct rte_ethdev_qdata rxq;
> - uintptr_t reserved1[3];
> + uintptr_t reserved1[2];
>          /**@}*/
> 
>          /**@{*/
> @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
>          eth_tx_prep_t tx_pkt_prepare;
>          /** Check the status of a Tx descriptor. */
>          eth_tx_descriptor_status_t tx_descriptor_status;
> + /** Copy used mbufs from Tx mbuf ring into Rx. */
> + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
>          /** Tx queues data. */
>          struct rte_ethdev_qdata txq;
> - uintptr_t reserved2[3];
> + uintptr_t reserved2[2];
>          /**@}*/
> 
>  } __rte_cache_aligned;
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 357d1a88c0..45c417f6bd 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -299,6 +299,10 @@ EXPERIMENTAL {
>          rte_flow_action_handle_query_update;
>          rte_flow_async_action_handle_query_update;
>          rte_flow_async_create_by_index;
> +
> + # added in 23.07
> + rte_eth_recycle_mbufs;
> + rte_eth_recycle_rx_queue_info_get;
>  };
> 
>  INTERNAL {
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-06-06  3:16       ` Feifei Wang
@ 2023-06-06  7:18         ` Konstantin Ananyev
  2023-06-06  7:58           ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Konstantin Ananyev @ 2023-06-06  7:18 UTC (permalink / raw)
  To: Feifei Wang,
	Константин
	Ананьев,
	Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd

> 
> Define specific function implementation for i40e driver.
> Currently, mbufs recycle mode can support 128bit
> vector path and avx2 path. And can be enabled both in
> fast free and no fast free mode.
> 
> Suggested-by: Honnappa Nagarahalli <mailto:honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <mailto:feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <mailto:honnappa.nagarahalli@arm.com>
> ---
>  drivers/net/i40e/i40e_ethdev.c | 1 +
>  drivers/net/i40e/i40e_ethdev.h | 2 +
>  .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140 ++++++++++++++++++
>  drivers/net/i40e/i40e_rxtx.c | 32 ++++
>  drivers/net/i40e/i40e_rxtx.h | 4 +
>  drivers/net/i40e/meson.build | 2 +
>  6 files changed, 181 insertions(+)
>  create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> 
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index f9d8f9791f..d4eecd16cf 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
>          .flow_ops_get = i40e_dev_flow_ops_get,
>          .rxq_info_get = i40e_rxq_info_get,
>          .txq_info_get = i40e_txq_info_get,
> + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
>          .rx_burst_mode_get = i40e_rx_burst_mode_get,
>          .tx_burst_mode_get = i40e_tx_burst_mode_get,
>          .timesync_enable = i40e_timesync_enable,
> diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
> index 9b806d130e..b5b2d6cf2b 100644
> --- a/drivers/net/i40e/i40e_ethdev.h
> +++ b/drivers/net/i40e/i40e_ethdev.h
> @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
>          struct rte_eth_rxq_info *qinfo);
>  void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
>          struct rte_eth_txq_info *qinfo);
> +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
>  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
>                             struct rte_eth_burst_mode *mode);
>  int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> new file mode 100644
> index 0000000000..08d708fd7d
> --- /dev/null
> +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> @@ -0,0 +1,140 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2023 Arm Limited.
> + */
> +
> +#include <stdint.h>
> +#include <ethdev_driver.h>
> +
> +#include "base/i40e_prototype.h"
> +#include "base/i40e_type.h"
> +#include "i40e_ethdev.h"
> +#include "i40e_rxtx.h"
> +
> +#pragma GCC diagnostic ignored "-Wcast-qual"
> +
> +void
> +i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
> +{
> + struct i40e_rx_queue *rxq = rx_queue;
> + struct i40e_rx_entry *rxep;
> + volatile union i40e_rx_desc *rxdp;
> + uint16_t rx_id;
> + uint64_t paddr;
> + uint64_t dma_addr;
> + uint16_t i;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + for (i = 0; i < nb_mbufs; i++) {
> + /* Initialize rxdp descs. */
> + paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr = rte_cpu_to_le_64(paddr);
> + /* flush desc with pa dma_addr */
> + rxdp[i].read.hdr_addr = 0;
> + rxdp[i].read.pkt_addr = dma_addr;
> + }
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += nb_mbufs;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = 0;
> + rx_id = rxq->nb_rx_desc - 1;
> + }
> +
> + rxq->rxrearm_nb -= nb_mbufs;
> +
> + rte_io_wmb();
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +}
> +
> +uint16_t
> +i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
> + struct rte_eth_recycle_rxq_info *recycle_rxq_info)
> +{
> + struct i40e_tx_queue *txq = tx_queue;
> + struct i40e_tx_entry *txep;
> + struct rte_mbuf **rxep;
> + struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> + int i, j, n;
> + uint16_t avail = 0;
> + uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
> + uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
> + uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
> + uint16_t refill_head = *recycle_rxq_info->refill_head;
> + uint16_t receive_tail = *recycle_rxq_info->receive_tail;
> +
> + /* Get available recycling Rx buffers. */
> + avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
> +
> + /* Check Tx free thresh and Rx available space. */
> + if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
> + return 0;
> +
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> + return 0;
> +
> + n = txq->tx_rs_thresh;
> +
> + /* Mbufs recycle mode can only support no ring buffer wrapping around.
> + * Two case for this:
> + *
> + * case 1: The refill head of Rx buffer ring needs to be aligned with
> + * mbuf ring size. In this case, the number of Tx freeing buffers
> + * should be equal to refill_requirement.
> + *
> + * case 2: The refill head of Rx ring buffer does not need to be aligned
> + * with mbuf ring size. In this case, the update of refill head can not
> + * exceed the Rx mbuf ring size.
> + */
> + if (refill_requirement != n ||
> + (!refill_requirement && (refill_head + n > mbuf_ring_size)))
> + return 0;
> +
> + /* First buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1).
> + */
> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> + rxep = recycle_rxq_info->mbuf_ring;
> + rxep += refill_head;
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* Directly put mbufs from Tx to Rx. */
> + for (i = 0; i < n; i++, rxep++, txep++)
> + *rxep = txep[0].mbuf;
> + } else {
> + for (i = 0, j = 0; i < n; i++) {
> + /* Avoid txq contains buffers from expected mempool. */
> + if (unlikely(recycle_rxq_info->mp
> + != txep[i].mbuf->pool))
> + return 0;
> I don't think that it is possible to simply return 0 here:
> we might already have some mbufs inside rxep[], so we probably need
> to return that number (j).
> 
> No, here is just pre-free, not actually put mbufs into rxeq.
 
I understand that.
What I am saying: after you call pktmbuf_prefree_seg(mbuf),
you can’t keep it in the txq anymore.
You have either to put it into rxep[], or back into mempool.
Also txq state (nb_tx_free, etc.) need to be updated.

> After run out of the loop, we call rte_memcpy to actually copy
> mbufs into rxep.
> +
> + m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +
> + /* In case 1, each of Tx buffers should be the
> + * last reference.
> + */
> + if (unlikely(m[j] == NULL && refill_requirement))
> + return 0;
> 
> same here, we can't simply return 0, it will introduce mbuf leakage.
> + /* In case 2, the number of valid Tx free
> + * buffers should be recorded.
> + */
> + j++;
> + }
> + rte_memcpy(rxep, m, sizeof(void *) * j);
> Wonder why do you need intermediate buffer for released mbufs?
> Why can't just:
> ...
> m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> ...
> rxep[j++] = m;
> ?
> Might save you few extra cycles.
> Sometimes ‘rte_pktmbuf_prefree_seg’ can return NULL due to
> mbuf->refcnt > 1. So we should firstly ensure all ‘m’ are valid and
> then copy them into rxep.

I understand that, but you can check is it NULL or not.

> + }
> +
> + /* Update counters for Tx. */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> + return n;
> +}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-06  7:10         ` Konstantin Ananyev
@ 2023-06-06  7:31           ` Feifei Wang
  2023-06-06  8:34             ` Konstantin Ananyev
  0 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-06-06  7:31 UTC (permalink / raw)
  To: Konstantin Ananyev,
	Константин
	Ананьев,
	thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd

[...]
> > Probably I am missing something, but why it is not possible to do something
> like that:
> >
> > rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> > tx_queue_id=M, ...); ....
> > rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> > tx_queue_id=K, ...);
> >
> > I.E. feed rx queue from 2 tx queues?
> >
> > Two problems for this:
> > 1. If we have 2 tx queues for rx, the thread should make the extra
> > judgement to decide which one to choose in the driver layer.
> 
> Not sure, why on the driver layer?
> The example I gave above - decision is made on application layer.
> Lets say first call didn't free enough mbufs, so app decided to use second txq
> for rearm.
[Feifei] I think currently mbuf recycle mode can support this usage. For examples:
n =  rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=M, ...);
if (n < planned_number)
rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=K, ...);

Thus, if users want, they can do like this. 

> 
> > On the other hand, current mechanism can support users to switch 1 txq
> > to another timely in the application layer. If user want to choose
> > another txq, he just need to change the txq_queue_id parameter in the API.
> > 2. If you want one rxq to support two txq at the same time, this needs
> > to add spinlock on guard variable to avoid multi-thread conflict.
> > Spinlock will decrease the data-path performance greatly.  Thus, we do
> > not consider
> > 1 rxq mapping multiple txqs here.
> 
> I am talking about situation when one thread controls 2 tx queues.
> 
> > + *
> > + * @param rx_port_id
> > + * Port identifying the receive side.
> > + * @param rx_queue_id
> > + * The index of the receive queue identifying the receive side.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> > +supplied
> > + * to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + * Port identifying the transmit side.
> > + * @param tx_queue_id
> > + * The index of the transmit queue identifying the transmit side.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> > +supplied
> > + * to rte_eth_dev_configure().
> > + * @param recycle_rxq_info
> > + * A pointer to a structure of type *rte_eth_recycle_rxq_info* which
> > +contains
> > + * the information of the Rx queue mbuf ring.
> > + * @return
> > + * The number of recycling mbufs.
> > + */
> > +__rte_experimental
> > +static inline uint16_t
> > +rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +uint16_t tx_port_id, uint16_t tx_queue_id,  struct
> > +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct rte_eth_fp_ops
> > +*p;  void *qd;  uint16_t nb_mbufs;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > + if (tx_port_id >= RTE_MAX_ETHPORTS ||  tx_queue_id >=
> > +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR,  "Invalid
> > +tx_port_id=%u or tx_queue_id=%u\n",  tx_port_id, tx_queue_id);
> > +return 0;  } #endif
> > +
> > + /* fetch pointer to queue data */
> > + p = &rte_eth_fp_ops[tx_port_id];
> > + qd = p->txq.data[tx_queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > + RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
> > +
> > + if (qd == NULL) {
> > + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> > +tx_queue_id, tx_port_id);  return 0;  } #endif  if
> > +(p->recycle_tx_mbufs_reuse == NULL)  return 0;
> > +
> > + /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
> > + * into Rx mbuf ring.
> > + */
> > + nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
> > +
> > + /* If no recycling mbufs, return 0. */ if (nb_mbufs == 0) return 0;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > + if (rx_port_id >= RTE_MAX_ETHPORTS ||  rx_queue_id >=
> > +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR, "Invalid
> > +rx_port_id=%u or rx_queue_id=%u\n",  rx_port_id, rx_queue_id);
> > +return 0;  } #endif
> > +
> > + /* fetch pointer to queue data */
> > + p = &rte_eth_fp_ops[rx_port_id];
> > + qd = p->rxq.data[rx_queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > + RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
> > +
> > + if (qd == NULL) {
> > + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> > +rx_queue_id, rx_port_id);  return 0;  } #endif
> > +
> > + if (p->recycle_rx_descriptors_refill == NULL) return 0;
> > +
> > + /* Replenish the Rx descriptors with the recycling
> > + * into Rx mbuf ring.
> > + */
> > + p->recycle_rx_descriptors_refill(qd, nb_mbufs);
> > +
> > + return nb_mbufs;
> > +}
> > +
> >  /**
> >   * @warning
> >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > index dcf8adab92..a2e6ea6b6c 100644
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > *rxq, uint16_t offset);
> >  /** @internal Check the status of a Tx descriptor */  typedef int
> > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> >
> > +/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
> > +typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,  struct
> > +rte_eth_recycle_rxq_info *recycle_rxq_info);
> > +
> > +/** @internal Refill Rx descriptors with the recycling mbufs */
> > +typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq,
> > +uint16_t nb);
> > +
> >  /**
> >   * @internal
> >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> >          eth_rx_queue_count_t rx_queue_count;
> >          /** Check the status of a Rx descriptor. */
> >          eth_rx_descriptor_status_t rx_descriptor_status;
> > + /** Refill Rx descriptors with the recycling mbufs. */
> > + eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> > I am afraid we can't put new fields here without ABI breakage.
> >
> > Agree
> >
> > It has to be below rxq.
> > Now thinking about current layout probably not the best one, and when
> > introducing this struct, I should probably put rxq either on the top
> > of the struct, or on the next cache line.
> > But such change is not possible right now anyway.
> > Same story for txq.
> >
> > Thus we should rearrange the structure like below:
> > struct rte_eth_fp_ops {
> >     struct rte_ethdev_qdata rxq;
> >          eth_rx_burst_t rx_pkt_burst;
> >          eth_rx_queue_count_t rx_queue_count;
> >          eth_rx_descriptor_status_t rx_descriptor_status;
> >        eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> >               uintptr_t reserved1[2];
> > }
> 
> Yes, I think such layout will be better.
> The only problem here - we have to wait for 23.11 for that.
> 
Ok, if not this change, maybe we still need to wait. Because mbufs_recycle have other
ABI breakage. Such as the change for 'struct rte_eth_dev'.
> >
> >
> >          /** Rx queues data. */
> >          struct rte_ethdev_qdata rxq;
> > - uintptr_t reserved1[3];
> > + uintptr_t reserved1[2];
> >          /**@}*/
> >
> >          /**@{*/
> > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> >          eth_tx_prep_t tx_pkt_prepare;
> >          /** Check the status of a Tx descriptor. */
> >          eth_tx_descriptor_status_t tx_descriptor_status;
> > + /** Copy used mbufs from Tx mbuf ring into Rx. */
> > + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> >          /** Tx queues data. */
> >          struct rte_ethdev_qdata txq;
> > - uintptr_t reserved2[3];
> > + uintptr_t reserved2[2];
> >          /**@}*/
> >
> >  } __rte_cache_aligned;
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 357d1a88c0..45c417f6bd 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -299,6 +299,10 @@ EXPERIMENTAL {
> >          rte_flow_action_handle_query_update;
> >          rte_flow_async_action_handle_query_update;
> >          rte_flow_async_create_by_index;
> > +
> > + # added in 23.07
> > + rte_eth_recycle_mbufs;
> > + rte_eth_recycle_rx_queue_info_get;
> >  };
> >
> >  INTERNAL {
> > --
> > 2.25.1
> >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-06-06  7:18         ` Konstantin Ananyev
@ 2023-06-06  7:58           ` Feifei Wang
  2023-06-06  8:27             ` Konstantin Ananyev
  0 siblings, 1 reply; 67+ messages in thread
From: Feifei Wang @ 2023-06-06  7:58 UTC (permalink / raw)
  To: Konstantin Ananyev,
	Константин
	Ананьев,
	Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Tuesday, June 6, 2023 3:18 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Константин Ананьев
> <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> <yuying.zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
> 
> >
> > Define specific function implementation for i40e driver.
> > Currently, mbufs recycle mode can support 128bit vector path and avx2
> > path. And can be enabled both in fast free and no fast free mode.
> >
> > Suggested-by: Honnappa Nagarahalli
> > <mailto:honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <mailto:feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli
> > <mailto:honnappa.nagarahalli@arm.com>
> > ---
> >  drivers/net/i40e/i40e_ethdev.c | 1 +
> >  drivers/net/i40e/i40e_ethdev.h | 2 +
> >  .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140
> ++++++++++++++++++
> > drivers/net/i40e/i40e_rxtx.c | 32 ++++  drivers/net/i40e/i40e_rxtx.h |
> > 4 +  drivers/net/i40e/meson.build | 2 +
> >  6 files changed, 181 insertions(+)
> >  create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> >
> > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > b/drivers/net/i40e/i40e_ethdev.c index f9d8f9791f..d4eecd16cf 100644
> > --- a/drivers/net/i40e/i40e_ethdev.c
> > +++ b/drivers/net/i40e/i40e_ethdev.c
> > @@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops =
> {
> >          .flow_ops_get = i40e_dev_flow_ops_get,
> >          .rxq_info_get = i40e_rxq_info_get,
> >          .txq_info_get = i40e_txq_info_get,
> > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> >          .rx_burst_mode_get = i40e_rx_burst_mode_get,
> >          .tx_burst_mode_get = i40e_tx_burst_mode_get,
> >          .timesync_enable = i40e_timesync_enable, diff --git
> > a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
> > index 9b806d130e..b5b2d6cf2b 100644
> > --- a/drivers/net/i40e/i40e_ethdev.h
> > +++ b/drivers/net/i40e/i40e_ethdev.h
> > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev,
> uint16_t queue_id,
> >          struct rte_eth_rxq_info *qinfo);  void
> > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> >          struct rte_eth_txq_info *qinfo);
> > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t
> > +queue_id,  struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> >                             struct rte_eth_burst_mode *mode);  int
> > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > new file mode 100644
> > index 0000000000..08d708fd7d
> > --- /dev/null
> > +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > @@ -0,0 +1,140 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2023 Arm Limited.
> > + */
> > +
> > +#include <stdint.h>
> > +#include <ethdev_driver.h>
> > +
> > +#include "base/i40e_prototype.h"
> > +#include "base/i40e_type.h"
> > +#include "i40e_ethdev.h"
> > +#include "i40e_rxtx.h"
> > +
> > +#pragma GCC diagnostic ignored "-Wcast-qual"
> > +
> > +void
> > +i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t
> > +nb_mbufs) {  struct i40e_rx_queue *rxq = rx_queue;  struct
> > +i40e_rx_entry *rxep;  volatile union i40e_rx_desc *rxdp;  uint16_t
> > +rx_id;  uint64_t paddr;  uint64_t dma_addr;  uint16_t i;
> > +
> > + rxdp = rxq->rx_ring + rxq->rxrearm_start; rxep =
> > + &rxq->sw_ring[rxq->rxrearm_start];
> > +
> > + for (i = 0; i < nb_mbufs; i++) {
> > + /* Initialize rxdp descs. */
> > + paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
> dma_addr =
> > + rte_cpu_to_le_64(paddr);
> > + /* flush desc with pa dma_addr */
> > + rxdp[i].read.hdr_addr = 0;
> > + rxdp[i].read.pkt_addr = dma_addr;
> > + }
> > +
> > + /* Update the descriptor initializer index */
> > + rxq->rxrearm_start += nb_mbufs;
> > + rx_id = rxq->rxrearm_start - 1;
> > +
> > + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> > + rxq->rxrearm_start = 0;
> > + rx_id = rxq->nb_rx_desc - 1;
> > + }
> > +
> > + rxq->rxrearm_nb -= nb_mbufs;
> > +
> > + rte_io_wmb();
> > + /* Update the tail pointer on the NIC */
> > +I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id); }
> > +
> > +uint16_t
> > +i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,  struct
> > +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct i40e_tx_queue
> > +*txq = tx_queue;  struct i40e_tx_entry *txep;  struct rte_mbuf
> > +**rxep;  struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];  int i, j,
> > +n;  uint16_t avail = 0;  uint16_t mbuf_ring_size =
> > +recycle_rxq_info->mbuf_ring_size;
> > + uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;  uint16_t
> > +refill_requirement = recycle_rxq_info->refill_requirement;
> > + uint16_t refill_head = *recycle_rxq_info->refill_head;  uint16_t
> > +receive_tail = *recycle_rxq_info->receive_tail;
> > +
> > + /* Get available recycling Rx buffers. */ avail = (mbuf_ring_size -
> > + (refill_head - receive_tail)) & mask;
> > +
> > + /* Check Tx free thresh and Rx available space. */ if
> > + (txq->nb_tx_free > txq->tx_free_thresh || avail <=
> > + txq->tx_rs_thresh) return 0;
> > +
> > + /* check DD bits on threshold descriptor */ if
> > + ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > + return 0;
> > +
> > + n = txq->tx_rs_thresh;
> > +
> > + /* Mbufs recycle mode can only support no ring buffer wrapping around.
> > + * Two case for this:
> > + *
> > + * case 1: The refill head of Rx buffer ring needs to be aligned with
> > + * mbuf ring size. In this case, the number of Tx freeing buffers
> > + * should be equal to refill_requirement.
> > + *
> > + * case 2: The refill head of Rx ring buffer does not need to be
> > + aligned
> > + * with mbuf ring size. In this case, the update of refill head can
> > + not
> > + * exceed the Rx mbuf ring size.
> > + */
> > + if (refill_requirement != n ||
> > + (!refill_requirement && (refill_head + n > mbuf_ring_size))) return
> > + 0;
> > +
> > + /* First buffer to free from S/W ring is at index
> > + * tx_next_dd - (tx_rs_thresh-1).
> > + */
> > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)]; rxep =
> > + recycle_rxq_info->mbuf_ring; rxep += refill_head;
> > +
> > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > + /* Directly put mbufs from Tx to Rx. */ for (i = 0; i < n; i++,
> > + rxep++, txep++) *rxep = txep[0].mbuf; } else { for (i = 0, j = 0; i
> > + < n; i++) {
> > + /* Avoid txq contains buffers from expected mempool. */ if
> > + (unlikely(recycle_rxq_info->mp != txep[i].mbuf->pool)) return 0;
> > I don't think that it is possible to simply return 0 here:
> > we might already have some mbufs inside rxep[], so we probably need to
> > return that number (j).
> >
> > No, here is just pre-free, not actually put mbufs into rxeq.
> 
> I understand that.
> What I am saying: after you call pktmbuf_prefree_seg(mbuf), you can’t keep it
> in the txq anymore.
> You have either to put it into rxep[], or back into mempool.
> Also txq state (nb_tx_free, etc.) need to be updated.
> 
> > After run out of the loop, we call rte_memcpy to actually copy mbufs
> > into rxep.
> > +
> > + m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > +
> > + /* In case 1, each of Tx buffers should be the
> > + * last reference.
> > + */
> > + if (unlikely(m[j] == NULL && refill_requirement)) return 0;
> >
> > same here, we can't simply return 0, it will introduce mbuf leakage.
> > + /* In case 2, the number of valid Tx free
> > + * buffers should be recorded.
> > + */
> > + j++;
> > + }
> > + rte_memcpy(rxep, m, sizeof(void *) * j);
> > Wonder why do you need intermediate buffer for released mbufs?
> > Why can't just:
> > ...
> > m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > ...
> > rxep[j++] = m;
> > ?
> > Might save you few extra cycles.
> > Sometimes ‘rte_pktmbuf_prefree_seg’ can return NULL due to
> > mbuf->refcnt > 1. So we should firstly ensure all ‘m’ are valid and
> > then copy them into rxep.
> 
> I understand that, but you can check is it NULL or not.
For i40e rxq, it must rearm ' RTE_I40E_RXQ_REARM_THRESH ' pkts once a time
based on its ring wrapping mechanism. 

For i40e txq, it must free ' txq->tx_rs_thresh' pkts once a time.

So we need firstly ensure all tx free mbufs are valid, and then copy these into rxq.
If not enough valid mbufs, it will break rxq's ring wrapping mechanism.

> 
> > + }
> > +
> > + /* Update counters for Tx. */
> > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > +
> > + return n;
> > +}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-06-06  7:58           ` Feifei Wang
@ 2023-06-06  8:27             ` Konstantin Ananyev
  2023-06-12  3:05               ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Konstantin Ananyev @ 2023-06-06  8:27 UTC (permalink / raw)
  To: Feifei Wang,
	Константин
	Ананьев,
	Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd



> > > Define specific function implementation for i40e driver.
> > > Currently, mbufs recycle mode can support 128bit vector path and avx2
> > > path. And can be enabled both in fast free and no fast free mode.
> > >
> > > Suggested-by: Honnappa Nagarahalli
> > > <mailto:honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Feifei Wang <mailto:feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli
> > > <mailto:honnappa.nagarahalli@arm.com>
> > > ---
> > >  drivers/net/i40e/i40e_ethdev.c | 1 +
> > >  drivers/net/i40e/i40e_ethdev.h | 2 +
> > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140
> > ++++++++++++++++++
> > > drivers/net/i40e/i40e_rxtx.c | 32 ++++  drivers/net/i40e/i40e_rxtx.h |
> > > 4 +  drivers/net/i40e/meson.build | 2 +
> > >  6 files changed, 181 insertions(+)
> > >  create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > >
> > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > b/drivers/net/i40e/i40e_ethdev.c index f9d8f9791f..d4eecd16cf 100644
> > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops =
> > {
> > >          .flow_ops_get = i40e_dev_flow_ops_get,
> > >          .rxq_info_get = i40e_rxq_info_get,
> > >          .txq_info_get = i40e_txq_info_get,
> > > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> > >          .rx_burst_mode_get = i40e_rx_burst_mode_get,
> > >          .tx_burst_mode_get = i40e_tx_burst_mode_get,
> > >          .timesync_enable = i40e_timesync_enable, diff --git
> > > a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
> > > index 9b806d130e..b5b2d6cf2b 100644
> > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev,
> > uint16_t queue_id,
> > >          struct rte_eth_rxq_info *qinfo);  void
> > > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > >          struct rte_eth_txq_info *qinfo);
> > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t
> > > +queue_id,  struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > >                             struct rte_eth_burst_mode *mode);  int
> > > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > new file mode 100644
> > > index 0000000000..08d708fd7d
> > > --- /dev/null
> > > +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > @@ -0,0 +1,140 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright (c) 2023 Arm Limited.
> > > + */
> > > +
> > > +#include <stdint.h>
> > > +#include <ethdev_driver.h>
> > > +
> > > +#include "base/i40e_prototype.h"
> > > +#include "base/i40e_type.h"
> > > +#include "i40e_ethdev.h"
> > > +#include "i40e_rxtx.h"
> > > +
> > > +#pragma GCC diagnostic ignored "-Wcast-qual"
> > > +
> > > +void
> > > +i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t
> > > +nb_mbufs) {  struct i40e_rx_queue *rxq = rx_queue;  struct
> > > +i40e_rx_entry *rxep;  volatile union i40e_rx_desc *rxdp;  uint16_t
> > > +rx_id;  uint64_t paddr;  uint64_t dma_addr;  uint16_t i;
> > > +
> > > + rxdp = rxq->rx_ring + rxq->rxrearm_start; rxep =
> > > + &rxq->sw_ring[rxq->rxrearm_start];
> > > +
> > > + for (i = 0; i < nb_mbufs; i++) {
> > > + /* Initialize rxdp descs. */
> > > + paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
> > dma_addr =
> > > + rte_cpu_to_le_64(paddr);
> > > + /* flush desc with pa dma_addr */
> > > + rxdp[i].read.hdr_addr = 0;
> > > + rxdp[i].read.pkt_addr = dma_addr;
> > > + }
> > > +
> > > + /* Update the descriptor initializer index */
> > > + rxq->rxrearm_start += nb_mbufs;
> > > + rx_id = rxq->rxrearm_start - 1;
> > > +
> > > + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> > > + rxq->rxrearm_start = 0;
> > > + rx_id = rxq->nb_rx_desc - 1;
> > > + }
> > > +
> > > + rxq->rxrearm_nb -= nb_mbufs;
> > > +
> > > + rte_io_wmb();
> > > + /* Update the tail pointer on the NIC */
> > > +I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id); }
> > > +
> > > +uint16_t
> > > +i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,  struct
> > > +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct i40e_tx_queue
> > > +*txq = tx_queue;  struct i40e_tx_entry *txep;  struct rte_mbuf
> > > +**rxep;  struct rte_mbuf *m[RTE_I40E_TX_MAX_FREE_BUF_SZ];  int i, j,
> > > +n;  uint16_t avail = 0;  uint16_t mbuf_ring_size =
> > > +recycle_rxq_info->mbuf_ring_size;
> > > + uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;  uint16_t
> > > +refill_requirement = recycle_rxq_info->refill_requirement;
> > > + uint16_t refill_head = *recycle_rxq_info->refill_head;  uint16_t
> > > +receive_tail = *recycle_rxq_info->receive_tail;
> > > +
> > > + /* Get available recycling Rx buffers. */ avail = (mbuf_ring_size -
> > > + (refill_head - receive_tail)) & mask;
> > > +
> > > + /* Check Tx free thresh and Rx available space. */ if
> > > + (txq->nb_tx_free > txq->tx_free_thresh || avail <=
> > > + txq->tx_rs_thresh) return 0;
> > > +
> > > + /* check DD bits on threshold descriptor */ if
> > > + ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > + return 0;
> > > +
> > > + n = txq->tx_rs_thresh;
> > > +
> > > + /* Mbufs recycle mode can only support no ring buffer wrapping around.
> > > + * Two case for this:
> > > + *
> > > + * case 1: The refill head of Rx buffer ring needs to be aligned with
> > > + * mbuf ring size. In this case, the number of Tx freeing buffers
> > > + * should be equal to refill_requirement.
> > > + *
> > > + * case 2: The refill head of Rx ring buffer does not need to be
> > > + aligned
> > > + * with mbuf ring size. In this case, the update of refill head can
> > > + not
> > > + * exceed the Rx mbuf ring size.
> > > + */
> > > + if (refill_requirement != n ||
> > > + (!refill_requirement && (refill_head + n > mbuf_ring_size))) return
> > > + 0;
> > > +
> > > + /* First buffer to free from S/W ring is at index
> > > + * tx_next_dd - (tx_rs_thresh-1).
> > > + */
> > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)]; rxep =
> > > + recycle_rxq_info->mbuf_ring; rxep += refill_head;
> > > +
> > > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > + /* Directly put mbufs from Tx to Rx. */ for (i = 0; i < n; i++,
> > > + rxep++, txep++) *rxep = txep[0].mbuf; } else { for (i = 0, j = 0; i
> > > + < n; i++) {
> > > + /* Avoid txq contains buffers from expected mempool. */ if
> > > + (unlikely(recycle_rxq_info->mp != txep[i].mbuf->pool)) return 0;
> > > I don't think that it is possible to simply return 0 here:
> > > we might already have some mbufs inside rxep[], so we probably need to
> > > return that number (j).
> > >
> > > No, here is just pre-free, not actually put mbufs into rxeq.
> >
> > I understand that.
> > What I am saying: after you call pktmbuf_prefree_seg(mbuf), you can’t keep it
> > in the txq anymore.
> > You have either to put it into rxep[], or back into mempool.
> > Also txq state (nb_tx_free, etc.) need to be updated.
> >
> > > After run out of the loop, we call rte_memcpy to actually copy mbufs
> > > into rxep.
> > > +
> > > + m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > > +
> > > + /* In case 1, each of Tx buffers should be the
> > > + * last reference.
> > > + */
> > > + if (unlikely(m[j] == NULL && refill_requirement)) return 0;
> > >
> > > same here, we can't simply return 0, it will introduce mbuf leakage.
> > > + /* In case 2, the number of valid Tx free
> > > + * buffers should be recorded.
> > > + */
> > > + j++;
> > > + }
> > > + rte_memcpy(rxep, m, sizeof(void *) * j);
> > > Wonder why do you need intermediate buffer for released mbufs?
> > > Why can't just:
> > > ...
> > > m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > > ...
> > > rxep[j++] = m;
> > > ?
> > > Might save you few extra cycles.
> > > Sometimes ‘rte_pktmbuf_prefree_seg’ can return NULL due to
> > > mbuf->refcnt > 1. So we should firstly ensure all ‘m’ are valid and
> > > then copy them into rxep.
> >
> > I understand that, but you can check is it NULL or not.
> For i40e rxq, it must rearm ' RTE_I40E_RXQ_REARM_THRESH ' pkts once a time
> based on its ring wrapping mechanism.
> 
> For i40e txq, it must free ' txq->tx_rs_thresh' pkts once a time.
> 
> So we need firstly ensure all tx free mbufs are valid, and then copy these into rxq.
> If not enough valid mbufs, it will break rxq's ring wrapping mechanism.
 
I think you can still copy mbufs into rxep[], if there are not enough mbufs,
you can still return 0 (or whatever is a proper value here), and that would mean
all these new rxep[] entries will be considered as invalid.
Anyway that's just a suggestion to avoid extra copy.

> 
> >
> > > + }
> > > +
> > > + /* Update counters for Tx. */
> > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > +
> > > + return n;
> > > +}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-06  7:31           ` Feifei Wang
@ 2023-06-06  8:34             ` Konstantin Ananyev
  2023-06-07  0:00               ` Ferruh Yigit
  0 siblings, 1 reply; 67+ messages in thread
From: Konstantin Ananyev @ 2023-06-06  8:34 UTC (permalink / raw)
  To: Feifei Wang,
	Константин
	Ананьев,
	thomas, Ferruh Yigit, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd



> 
> [...]
> > > Probably I am missing something, but why it is not possible to do something
> > like that:
> > >
> > > rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> > > tx_queue_id=M, ...); ....
> > > rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> > > tx_queue_id=K, ...);
> > >
> > > I.E. feed rx queue from 2 tx queues?
> > >
> > > Two problems for this:
> > > 1. If we have 2 tx queues for rx, the thread should make the extra
> > > judgement to decide which one to choose in the driver layer.
> >
> > Not sure, why on the driver layer?
> > The example I gave above - decision is made on application layer.
> > Lets say first call didn't free enough mbufs, so app decided to use second txq
> > for rearm.
> [Feifei] I think currently mbuf recycle mode can support this usage. For examples:
> n =  rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=M, ...);
> if (n < planned_number)
> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=K, ...);
> 
> Thus, if users want, they can do like this.

Yes, that was my thought, that's why I was surprise that in the comments we have:
" Currently, the rte_eth_recycle_mbufs() function can only support one-time pairing
* between the receive queue and transmit queue. Do not pair one receive queue with
 * multiple transmit queues or pair one transmit queue with multiple receive queues,
 * in order to avoid memory error rewriting."

> 
> >
> > > On the other hand, current mechanism can support users to switch 1 txq
> > > to another timely in the application layer. If user want to choose
> > > another txq, he just need to change the txq_queue_id parameter in the API.
> > > 2. If you want one rxq to support two txq at the same time, this needs
> > > to add spinlock on guard variable to avoid multi-thread conflict.
> > > Spinlock will decrease the data-path performance greatly.  Thus, we do
> > > not consider
> > > 1 rxq mapping multiple txqs here.
> >
> > I am talking about situation when one thread controls 2 tx queues.
> >
> > > + *
> > > + * @param rx_port_id
> > > + * Port identifying the receive side.
> > > + * @param rx_queue_id
> > > + * The index of the receive queue identifying the receive side.
> > > + * The value must be in the range [0, nb_rx_queue - 1] previously
> > > +supplied
> > > + * to rte_eth_dev_configure().
> > > + * @param tx_port_id
> > > + * Port identifying the transmit side.
> > > + * @param tx_queue_id
> > > + * The index of the transmit queue identifying the transmit side.
> > > + * The value must be in the range [0, nb_tx_queue - 1] previously
> > > +supplied
> > > + * to rte_eth_dev_configure().
> > > + * @param recycle_rxq_info
> > > + * A pointer to a structure of type *rte_eth_recycle_rxq_info* which
> > > +contains
> > > + * the information of the Rx queue mbuf ring.
> > > + * @return
> > > + * The number of recycling mbufs.
> > > + */
> > > +__rte_experimental
> > > +static inline uint16_t
> > > +rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > +uint16_t tx_port_id, uint16_t tx_queue_id,  struct
> > > +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct rte_eth_fp_ops
> > > +*p;  void *qd;  uint16_t nb_mbufs;
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > + if (tx_port_id >= RTE_MAX_ETHPORTS ||  tx_queue_id >=
> > > +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR,  "Invalid
> > > +tx_port_id=%u or tx_queue_id=%u\n",  tx_port_id, tx_queue_id);
> > > +return 0;  } #endif
> > > +
> > > + /* fetch pointer to queue data */
> > > + p = &rte_eth_fp_ops[tx_port_id];
> > > + qd = p->txq.data[tx_queue_id];
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_TX
> > > + RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
> > > +
> > > + if (qd == NULL) {
> > > + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> > > +tx_queue_id, tx_port_id);  return 0;  } #endif  if
> > > +(p->recycle_tx_mbufs_reuse == NULL)  return 0;
> > > +
> > > + /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
> > > + * into Rx mbuf ring.
> > > + */
> > > + nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
> > > +
> > > + /* If no recycling mbufs, return 0. */ if (nb_mbufs == 0) return 0;
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > + if (rx_port_id >= RTE_MAX_ETHPORTS ||  rx_queue_id >=
> > > +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR, "Invalid
> > > +rx_port_id=%u or rx_queue_id=%u\n",  rx_port_id, rx_queue_id);
> > > +return 0;  } #endif
> > > +
> > > + /* fetch pointer to queue data */
> > > + p = &rte_eth_fp_ops[rx_port_id];
> > > + qd = p->rxq.data[rx_queue_id];
> > > +
> > > +#ifdef RTE_ETHDEV_DEBUG_RX
> > > + RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
> > > +
> > > + if (qd == NULL) {
> > > + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> > > +rx_queue_id, rx_port_id);  return 0;  } #endif
> > > +
> > > + if (p->recycle_rx_descriptors_refill == NULL) return 0;
> > > +
> > > + /* Replenish the Rx descriptors with the recycling
> > > + * into Rx mbuf ring.
> > > + */
> > > + p->recycle_rx_descriptors_refill(qd, nb_mbufs);
> > > +
> > > + return nb_mbufs;
> > > +}
> > > +
> > >  /**
> > >   * @warning
> > >   * @b EXPERIMENTAL: this API may change without prior notice diff
> > > --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> > > index dcf8adab92..a2e6ea6b6c 100644
> > > --- a/lib/ethdev/rte_ethdev_core.h
> > > +++ b/lib/ethdev/rte_ethdev_core.h
> > > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> > > *rxq, uint16_t offset);
> > >  /** @internal Check the status of a Tx descriptor */  typedef int
> > > (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> > >
> > > +/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
> > > +typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,  struct
> > > +rte_eth_recycle_rxq_info *recycle_rxq_info);
> > > +
> > > +/** @internal Refill Rx descriptors with the recycling mbufs */
> > > +typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq,
> > > +uint16_t nb);
> > > +
> > >  /**
> > >   * @internal
> > >   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
> > > -90,9 +97,11 @@ struct rte_eth_fp_ops {
> > >          eth_rx_queue_count_t rx_queue_count;
> > >          /** Check the status of a Rx descriptor. */
> > >          eth_rx_descriptor_status_t rx_descriptor_status;
> > > + /** Refill Rx descriptors with the recycling mbufs. */
> > > + eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> > > I am afraid we can't put new fields here without ABI breakage.
> > >
> > > Agree
> > >
> > > It has to be below rxq.
> > > Now thinking about current layout probably not the best one, and when
> > > introducing this struct, I should probably put rxq either on the top
> > > of the struct, or on the next cache line.
> > > But such change is not possible right now anyway.
> > > Same story for txq.
> > >
> > > Thus we should rearrange the structure like below:
> > > struct rte_eth_fp_ops {
> > >     struct rte_ethdev_qdata rxq;
> > >          eth_rx_burst_t rx_pkt_burst;
> > >          eth_rx_queue_count_t rx_queue_count;
> > >          eth_rx_descriptor_status_t rx_descriptor_status;
> > >        eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> > >               uintptr_t reserved1[2];
> > > }
> >
> > Yes, I think such layout will be better.
> > The only problem here - we have to wait for 23.11 for that.
> >
> Ok, if not this change, maybe we still need to wait. Because mbufs_recycle have other
> ABI breakage. Such as the change for 'struct rte_eth_dev'.

Ok by me.

> > >
> > >
> > >          /** Rx queues data. */
> > >          struct rte_ethdev_qdata rxq;
> > > - uintptr_t reserved1[3];
> > > + uintptr_t reserved1[2];
> > >          /**@}*/
> > >
> > >          /**@{*/
> > > @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> > >          eth_tx_prep_t tx_pkt_prepare;
> > >          /** Check the status of a Tx descriptor. */
> > >          eth_tx_descriptor_status_t tx_descriptor_status;
> > > + /** Copy used mbufs from Tx mbuf ring into Rx. */
> > > + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> > >          /** Tx queues data. */
> > >          struct rte_ethdev_qdata txq;
> > > - uintptr_t reserved2[3];
> > > + uintptr_t reserved2[2];
> > >          /**@}*/
> > >
> > >  } __rte_cache_aligned;
> > > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > > 357d1a88c0..45c417f6bd 100644
> > > --- a/lib/ethdev/version.map
> > > +++ b/lib/ethdev/version.map
> > > @@ -299,6 +299,10 @@ EXPERIMENTAL {
> > >          rte_flow_action_handle_query_update;
> > >          rte_flow_async_action_handle_query_update;
> > >          rte_flow_async_create_by_index;
> > > +
> > > + # added in 23.07
> > > + rte_eth_recycle_mbufs;
> > > + rte_eth_recycle_rx_queue_info_get;
> > >  };
> > >
> > >  INTERNAL {
> > > --
> > > 2.25.1
> > >

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-06  8:34             ` Konstantin Ananyev
@ 2023-06-07  0:00               ` Ferruh Yigit
  2023-06-12  3:25                 ` Feifei Wang
  0 siblings, 1 reply; 67+ messages in thread
From: Ferruh Yigit @ 2023-06-07  0:00 UTC (permalink / raw)
  To: Konstantin Ananyev, Feifei Wang,
	Константин
	Ананьев,
	thomas, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

On 6/6/2023 9:34 AM, Konstantin Ananyev wrote:
> 
> 
>>
>> [...]
>>>> Probably I am missing something, but why it is not possible to do something
>>> like that:
>>>>
>>>> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
>>>> tx_queue_id=M, ...); ....
>>>> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
>>>> tx_queue_id=K, ...);
>>>>
>>>> I.E. feed rx queue from 2 tx queues?
>>>>
>>>> Two problems for this:
>>>> 1. If we have 2 tx queues for rx, the thread should make the extra
>>>> judgement to decide which one to choose in the driver layer.
>>>
>>> Not sure, why on the driver layer?
>>> The example I gave above - decision is made on application layer.
>>> Lets say first call didn't free enough mbufs, so app decided to use second txq
>>> for rearm.
>> [Feifei] I think currently mbuf recycle mode can support this usage. For examples:
>> n =  rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=M, ...);
>> if (n < planned_number)
>> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N, tx_queue_id=K, ...);
>>
>> Thus, if users want, they can do like this.
> 
> Yes, that was my thought, that's why I was surprise that in the comments we have:
> " Currently, the rte_eth_recycle_mbufs() function can only support one-time pairing
> * between the receive queue and transmit queue. Do not pair one receive queue with
>  * multiple transmit queues or pair one transmit queue with multiple receive queues,
>  * in order to avoid memory error rewriting."
> 

I guess that is from previous versions of the set, it can be good to
address limitations/restrictions again with latest version.


>>
>>>
>>>> On the other hand, current mechanism can support users to switch 1 txq
>>>> to another timely in the application layer. If user want to choose
>>>> another txq, he just need to change the txq_queue_id parameter in the API.
>>>> 2. If you want one rxq to support two txq at the same time, this needs
>>>> to add spinlock on guard variable to avoid multi-thread conflict.
>>>> Spinlock will decrease the data-path performance greatly.  Thus, we do
>>>> not consider
>>>> 1 rxq mapping multiple txqs here.
>>>
>>> I am talking about situation when one thread controls 2 tx queues.
>>>
>>>> + *
>>>> + * @param rx_port_id
>>>> + * Port identifying the receive side.
>>>> + * @param rx_queue_id
>>>> + * The index of the receive queue identifying the receive side.
>>>> + * The value must be in the range [0, nb_rx_queue - 1] previously
>>>> +supplied
>>>> + * to rte_eth_dev_configure().
>>>> + * @param tx_port_id
>>>> + * Port identifying the transmit side.
>>>> + * @param tx_queue_id
>>>> + * The index of the transmit queue identifying the transmit side.
>>>> + * The value must be in the range [0, nb_tx_queue - 1] previously
>>>> +supplied
>>>> + * to rte_eth_dev_configure().
>>>> + * @param recycle_rxq_info
>>>> + * A pointer to a structure of type *rte_eth_recycle_rxq_info* which
>>>> +contains
>>>> + * the information of the Rx queue mbuf ring.
>>>> + * @return
>>>> + * The number of recycling mbufs.
>>>> + */
>>>> +__rte_experimental
>>>> +static inline uint16_t
>>>> +rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
>>>> +uint16_t tx_port_id, uint16_t tx_queue_id,  struct
>>>> +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct rte_eth_fp_ops
>>>> +*p;  void *qd;  uint16_t nb_mbufs;
>>>> +
>>>> +#ifdef RTE_ETHDEV_DEBUG_TX
>>>> + if (tx_port_id >= RTE_MAX_ETHPORTS ||  tx_queue_id >=
>>>> +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR,  "Invalid
>>>> +tx_port_id=%u or tx_queue_id=%u\n",  tx_port_id, tx_queue_id);
>>>> +return 0;  } #endif
>>>> +
>>>> + /* fetch pointer to queue data */
>>>> + p = &rte_eth_fp_ops[tx_port_id];
>>>> + qd = p->txq.data[tx_queue_id];
>>>> +
>>>> +#ifdef RTE_ETHDEV_DEBUG_TX
>>>> + RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
>>>> +
>>>> + if (qd == NULL) {
>>>> + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
>>>> +tx_queue_id, tx_port_id);  return 0;  } #endif  if
>>>> +(p->recycle_tx_mbufs_reuse == NULL)  return 0;
>>>> +
>>>> + /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
>>>> + * into Rx mbuf ring.
>>>> + */
>>>> + nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
>>>> +
>>>> + /* If no recycling mbufs, return 0. */ if (nb_mbufs == 0) return 0;
>>>> +
>>>> +#ifdef RTE_ETHDEV_DEBUG_RX
>>>> + if (rx_port_id >= RTE_MAX_ETHPORTS ||  rx_queue_id >=
>>>> +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR, "Invalid
>>>> +rx_port_id=%u or rx_queue_id=%u\n",  rx_port_id, rx_queue_id);
>>>> +return 0;  } #endif
>>>> +
>>>> + /* fetch pointer to queue data */
>>>> + p = &rte_eth_fp_ops[rx_port_id];
>>>> + qd = p->rxq.data[rx_queue_id];
>>>> +
>>>> +#ifdef RTE_ETHDEV_DEBUG_RX
>>>> + RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
>>>> +
>>>> + if (qd == NULL) {
>>>> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
>>>> +rx_queue_id, rx_port_id);  return 0;  } #endif
>>>> +
>>>> + if (p->recycle_rx_descriptors_refill == NULL) return 0;
>>>> +
>>>> + /* Replenish the Rx descriptors with the recycling
>>>> + * into Rx mbuf ring.
>>>> + */
>>>> + p->recycle_rx_descriptors_refill(qd, nb_mbufs);
>>>> +
>>>> + return nb_mbufs;
>>>> +}
>>>> +
>>>>  /**
>>>>   * @warning
>>>>   * @b EXPERIMENTAL: this API may change without prior notice diff
>>>> --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
>>>> index dcf8adab92..a2e6ea6b6c 100644
>>>> --- a/lib/ethdev/rte_ethdev_core.h
>>>> +++ b/lib/ethdev/rte_ethdev_core.h
>>>> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
>>>> *rxq, uint16_t offset);
>>>>  /** @internal Check the status of a Tx descriptor */  typedef int
>>>> (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
>>>>
>>>> +/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
>>>> +typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,  struct
>>>> +rte_eth_recycle_rxq_info *recycle_rxq_info);
>>>> +
>>>> +/** @internal Refill Rx descriptors with the recycling mbufs */
>>>> +typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq,
>>>> +uint16_t nb);
>>>> +
>>>>  /**
>>>>   * @internal
>>>>   * Structure used to hold opaque pointers to internal ethdev Rx/Tx @@
>>>> -90,9 +97,11 @@ struct rte_eth_fp_ops {
>>>>          eth_rx_queue_count_t rx_queue_count;
>>>>          /** Check the status of a Rx descriptor. */
>>>>          eth_rx_descriptor_status_t rx_descriptor_status;
>>>> + /** Refill Rx descriptors with the recycling mbufs. */
>>>> + eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
>>>> I am afraid we can't put new fields here without ABI breakage.
>>>>
>>>> Agree
>>>>
>>>> It has to be below rxq.
>>>> Now thinking about current layout probably not the best one, and when
>>>> introducing this struct, I should probably put rxq either on the top
>>>> of the struct, or on the next cache line.
>>>> But such change is not possible right now anyway.
>>>> Same story for txq.
>>>>
>>>> Thus we should rearrange the structure like below:
>>>> struct rte_eth_fp_ops {
>>>>     struct rte_ethdev_qdata rxq;
>>>>          eth_rx_burst_t rx_pkt_burst;
>>>>          eth_rx_queue_count_t rx_queue_count;
>>>>          eth_rx_descriptor_status_t rx_descriptor_status;
>>>>        eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
>>>>               uintptr_t reserved1[2];
>>>> }
>>>
>>> Yes, I think such layout will be better.
>>> The only problem here - we have to wait for 23.11 for that.
>>>
>> Ok, if not this change, maybe we still need to wait. Because mbufs_recycle have other
>> ABI breakage. Such as the change for 'struct rte_eth_dev'.
> 
> Ok by me.
> 
>>>>
>>>>
>>>>          /** Rx queues data. */
>>>>          struct rte_ethdev_qdata rxq;
>>>> - uintptr_t reserved1[3];
>>>> + uintptr_t reserved1[2];
>>>>          /**@}*/
>>>>
>>>>          /**@{*/
>>>> @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
>>>>          eth_tx_prep_t tx_pkt_prepare;
>>>>          /** Check the status of a Tx descriptor. */
>>>>          eth_tx_descriptor_status_t tx_descriptor_status;
>>>> + /** Copy used mbufs from Tx mbuf ring into Rx. */
>>>> + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
>>>>          /** Tx queues data. */
>>>>          struct rte_ethdev_qdata txq;
>>>> - uintptr_t reserved2[3];
>>>> + uintptr_t reserved2[2];
>>>>          /**@}*/
>>>>
>>>>  } __rte_cache_aligned;
>>>> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
>>>> 357d1a88c0..45c417f6bd 100644
>>>> --- a/lib/ethdev/version.map
>>>> +++ b/lib/ethdev/version.map
>>>> @@ -299,6 +299,10 @@ EXPERIMENTAL {
>>>>          rte_flow_action_handle_query_update;
>>>>          rte_flow_async_action_handle_query_update;
>>>>          rte_flow_async_create_by_index;
>>>> +
>>>> + # added in 23.07
>>>> + rte_eth_recycle_mbufs;
>>>> + rte_eth_recycle_rx_queue_info_get;
>>>>  };
>>>>
>>>>  INTERNAL {
>>>> --
>>>> 2.25.1
>>>>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
  2023-06-06  8:27             ` Konstantin Ananyev
@ 2023-06-12  3:05               ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-06-12  3:05 UTC (permalink / raw)
  To: Konstantin Ananyev,
	Константин
	Ананьев,
	Yuying Zhang, Beilei Xing
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd, nd



> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Tuesday, June 6, 2023 4:27 PM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Константин Ананьев
> <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> <yuying.zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v6 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > > > Define specific function implementation for i40e driver.
> > > > Currently, mbufs recycle mode can support 128bit vector path and
> > > > avx2 path. And can be enabled both in fast free and no fast free mode.
> > > >
> > > > Suggested-by: Honnappa Nagarahalli
> > > > <mailto:honnappa.nagarahalli@arm.com>
> > > > Signed-off-by: Feifei Wang <mailto:feifei.wang2@arm.com>
> > > > Reviewed-by: Ruifeng Wang <mailto:ruifeng.wang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli
> > > > <mailto:honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_ethdev.c | 1 +
> > > > drivers/net/i40e/i40e_ethdev.h | 2 +
> > > > .../net/i40e/i40e_recycle_mbufs_vec_common.c | 140
> > > ++++++++++++++++++
> > > > drivers/net/i40e/i40e_rxtx.c | 32 ++++
> > > > drivers/net/i40e/i40e_rxtx.h |
> > > > 4 +  drivers/net/i40e/meson.build | 2 +
> > > >  6 files changed, 181 insertions(+)  create mode 100644
> > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > b/drivers/net/i40e/i40e_ethdev.c index f9d8f9791f..d4eecd16cf
> > > > 100644
> > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > i40e_eth_dev_ops =
> > > {
> > > >          .flow_ops_get = i40e_dev_flow_ops_get,
> > > >          .rxq_info_get = i40e_rxq_info_get,
> > > >          .txq_info_get = i40e_txq_info_get,
> > > > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> > > >          .rx_burst_mode_get = i40e_rx_burst_mode_get,
> > > >          .tx_burst_mode_get = i40e_tx_burst_mode_get,
> > > >          .timesync_enable = i40e_timesync_enable, diff --git
> > > > a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
> > > > index 9b806d130e..b5b2d6cf2b 100644
> > > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev
> > > > *dev,
> > > uint16_t queue_id,
> > > >          struct rte_eth_rxq_info *qinfo);  void
> > > > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > >          struct rte_eth_txq_info *qinfo);
> > > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t
> > > > +queue_id,  struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t
> queue_id,
> > > >                             struct rte_eth_burst_mode *mode);  int
> > > > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > > diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > new file mode 100644
> > > > index 0000000000..08d708fd7d
> > > > --- /dev/null
> > > > +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > @@ -0,0 +1,140 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright (c) 2023 Arm Limited.
> > > > + */
> > > > +
> > > > +#include <stdint.h>
> > > > +#include <ethdev_driver.h>
> > > > +
> > > > +#include "base/i40e_prototype.h"
> > > > +#include "base/i40e_type.h"
> > > > +#include "i40e_ethdev.h"
> > > > +#include "i40e_rxtx.h"
> > > > +
> > > > +#pragma GCC diagnostic ignored "-Wcast-qual"
> > > > +
> > > > +void
> > > > +i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t
> > > > +nb_mbufs) {  struct i40e_rx_queue *rxq = rx_queue;  struct
> > > > +i40e_rx_entry *rxep;  volatile union i40e_rx_desc *rxdp;
> > > > +uint16_t rx_id;  uint64_t paddr;  uint64_t dma_addr;  uint16_t i;
> > > > +
> > > > + rxdp = rxq->rx_ring + rxq->rxrearm_start; rxep =
> > > > + &rxq->sw_ring[rxq->rxrearm_start];
> > > > +
> > > > + for (i = 0; i < nb_mbufs; i++) {
> > > > + /* Initialize rxdp descs. */
> > > > + paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
> > > dma_addr =
> > > > + rte_cpu_to_le_64(paddr);
> > > > + /* flush desc with pa dma_addr */ rxdp[i].read.hdr_addr = 0;
> > > > + rxdp[i].read.pkt_addr = dma_addr; }
> > > > +
> > > > + /* Update the descriptor initializer index */
> > > > + rxq->rxrearm_start += nb_mbufs;
> > > > + rx_id = rxq->rxrearm_start - 1;
> > > > +
> > > > + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> > > > + rxq->rxrearm_start = 0;
> > > > + rx_id = rxq->nb_rx_desc - 1;
> > > > + }
> > > > +
> > > > + rxq->rxrearm_nb -= nb_mbufs;
> > > > +
> > > > + rte_io_wmb();
> > > > + /* Update the tail pointer on the NIC */
> > > > +I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id); }
> > > > +
> > > > +uint16_t
> > > > +i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,  struct
> > > > +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct
> > > > +i40e_tx_queue *txq = tx_queue;  struct i40e_tx_entry *txep;
> > > > +struct rte_mbuf **rxep;  struct rte_mbuf
> > > > +*m[RTE_I40E_TX_MAX_FREE_BUF_SZ];  int i, j, n;  uint16_t avail =
> > > > +0;  uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
> > > > +uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;  uint16_t
> > > > +refill_requirement = recycle_rxq_info->refill_requirement;
> > > > + uint16_t refill_head = *recycle_rxq_info->refill_head;  uint16_t
> > > > +receive_tail = *recycle_rxq_info->receive_tail;
> > > > +
> > > > + /* Get available recycling Rx buffers. */ avail =
> > > > + (mbuf_ring_size - (refill_head - receive_tail)) & mask;
> > > > +
> > > > + /* Check Tx free thresh and Rx available space. */ if
> > > > + (txq->nb_tx_free > txq->tx_free_thresh || avail <=
> > > > + txq->tx_rs_thresh) return 0;
> > > > +
> > > > + /* check DD bits on threshold descriptor */ if
> > > > + ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > > + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > > + return 0;
> > > > +
> > > > + n = txq->tx_rs_thresh;
> > > > +
> > > > + /* Mbufs recycle mode can only support no ring buffer wrapping
> around.
> > > > + * Two case for this:
> > > > + *
> > > > + * case 1: The refill head of Rx buffer ring needs to be aligned
> > > > + with
> > > > + * mbuf ring size. In this case, the number of Tx freeing buffers
> > > > + * should be equal to refill_requirement.
> > > > + *
> > > > + * case 2: The refill head of Rx ring buffer does not need to be
> > > > + aligned
> > > > + * with mbuf ring size. In this case, the update of refill head
> > > > + can not
> > > > + * exceed the Rx mbuf ring size.
> > > > + */
> > > > + if (refill_requirement != n ||
> > > > + (!refill_requirement && (refill_head + n > mbuf_ring_size)))
> > > > + return 0;
> > > > +
> > > > + /* First buffer to free from S/W ring is at index
> > > > + * tx_next_dd - (tx_rs_thresh-1).
> > > > + */
> > > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)]; rxep =
> > > > + recycle_rxq_info->mbuf_ring; rxep += refill_head;
> > > > +
> > > > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > > + /* Directly put mbufs from Tx to Rx. */ for (i = 0; i < n; i++,
> > > > + rxep++, txep++) *rxep = txep[0].mbuf; } else { for (i = 0, j =
> > > > + rxep++0; i
> > > > + < n; i++) {
> > > > + /* Avoid txq contains buffers from expected mempool. */ if
> > > > + (unlikely(recycle_rxq_info->mp != txep[i].mbuf->pool)) return 0;
> > > > I don't think that it is possible to simply return 0 here:
> > > > we might already have some mbufs inside rxep[], so we probably
> > > > need to return that number (j).
> > > >
> > > > No, here is just pre-free, not actually put mbufs into rxeq.
> > >
> > > I understand that.
> > > What I am saying: after you call pktmbuf_prefree_seg(mbuf), you
> > > can’t keep it in the txq anymore.
> > > You have either to put it into rxep[], or back into mempool.
> > > Also txq state (nb_tx_free, etc.) need to be updated.
> > >
> > > > After run out of the loop, we call rte_memcpy to actually copy
> > > > mbufs into rxep.
> > > > +
> > > > + m[j] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > > > +
> > > > + /* In case 1, each of Tx buffers should be the
> > > > + * last reference.
> > > > + */
> > > > + if (unlikely(m[j] == NULL && refill_requirement)) return 0;
> > > >
> > > > same here, we can't simply return 0, it will introduce mbuf leakage.
> > > > + /* In case 2, the number of valid Tx free
> > > > + * buffers should be recorded.
> > > > + */
> > > > + j++;
> > > > + }
> > > > + rte_memcpy(rxep, m, sizeof(void *) * j);
> > > > Wonder why do you need intermediate buffer for released mbufs?
> > > > Why can't just:
> > > > ...
> > > > m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > > > ...
> > > > rxep[j++] = m;
> > > > ?
> > > > Might save you few extra cycles.
> > > > Sometimes ‘rte_pktmbuf_prefree_seg’ can return NULL due to
> > > > mbuf->refcnt > 1. So we should firstly ensure all ‘m’ are valid
> > > > mbuf->and
> > > > then copy them into rxep.
> > >
> > > I understand that, but you can check is it NULL or not.
> > For i40e rxq, it must rearm ' RTE_I40E_RXQ_REARM_THRESH ' pkts once a
> > time based on its ring wrapping mechanism.
> >
> > For i40e txq, it must free ' txq->tx_rs_thresh' pkts once a time.
> >
> > So we need firstly ensure all tx free mbufs are valid, and then copy these into
> rxq.
> > If not enough valid mbufs, it will break rxq's ring wrapping mechanism.
> 
> I think you can still copy mbufs into rxep[], if there are not enough mbufs, you
> can still return 0 (or whatever is a proper value here), and that would mean all
> these new rxep[] entries will be considered as invalid.
> Anyway that's just a suggestion to avoid extra copy.

If I understand correctly, you means we can firstly copy mbufs into rxep.
If there are invalid buffers, previous copied buffers are also considered as invalid.

Thus, this can save CPU cycles in most of the correct cases. And just in few case,
We need to give up the copied rxep[] buffers.

That's a good comments, I agree that we can do like this.
> 
> >
> > >
> > > > + }
> > > > +
> > > > + /* Update counters for Tx. */
> > > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free +
> > > > + txq->txq->tx_rs_thresh); tx_next_dd = (uint16_t)(txq->tx_next_dd
> > > > + txq->+ txq->tx_rs_thresh);
> > > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > > +
> > > > + return n;
> > > > +}

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
  2023-06-07  0:00               ` Ferruh Yigit
@ 2023-06-12  3:25                 ` Feifei Wang
  0 siblings, 0 replies; 67+ messages in thread
From: Feifei Wang @ 2023-06-12  3:25 UTC (permalink / raw)
  To: Ferruh Yigit, Konstantin Ananyev,
	Константин
	Ананьев,
	thomas, Andrew Rybchenko
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Wednesday, June 7, 2023 8:01 AM
> To: Konstantin Ananyev <konstantin.ananyev@huawei.com>; Feifei Wang
> <Feifei.Wang2@arm.com>; Константин Ананьев
> <konstantin.v.ananyev@yandex.ru>; thomas@monjalon.net; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v6 1/4] ethdev: add API for mbufs recycle mode
> 
> On 6/6/2023 9:34 AM, Konstantin Ananyev wrote:
> >
> >
> >>
> >> [...]
> >>>> Probably I am missing something, but why it is not possible to do
> >>>> something
> >>> like that:
> >>>>
> >>>> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> >>>> tx_queue_id=M, ...); ....
> >>>> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> >>>> tx_queue_id=K, ...);
> >>>>
> >>>> I.E. feed rx queue from 2 tx queues?
> >>>>
> >>>> Two problems for this:
> >>>> 1. If we have 2 tx queues for rx, the thread should make the extra
> >>>> judgement to decide which one to choose in the driver layer.
> >>>
> >>> Not sure, why on the driver layer?
> >>> The example I gave above - decision is made on application layer.
> >>> Lets say first call didn't free enough mbufs, so app decided to use
> >>> second txq for rearm.
> >> [Feifei] I think currently mbuf recycle mode can support this usage. For
> examples:
> >> n =  rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> >> tx_queue_id=M, ...); if (n < planned_number)
> >> rte_eth_recycle_mbufs(rx_port_id=X, rx_queue_id=Y, tx_port_id=N,
> >> tx_queue_id=K, ...);
> >>
> >> Thus, if users want, they can do like this.
> >
> > Yes, that was my thought, that's why I was surprise that in the comments we
> have:
> > " Currently, the rte_eth_recycle_mbufs() function can only support
> > one-time pairing
> > * between the receive queue and transmit queue. Do not pair one
> > receive queue with
> >  * multiple transmit queues or pair one transmit queue with multiple
> > receive queues,
> >  * in order to avoid memory error rewriting."
> >
> 
> I guess that is from previous versions of the set, it can be good to address
> limitations/restrictions again with latest version.

[Feifei]  Sorry, I think this is due to my ambiguous expression in function description.
I wanted to show 'mbufs_recycle' cannot support multiple threads.

I will change the description and add extra expression to tell users that they
can change config from one txq to another in single thread.
Thanks for the comments. 

> 
> 
> >>
> >>>
> >>>> On the other hand, current mechanism can support users to switch 1
> >>>> txq to another timely in the application layer. If user want to
> >>>> choose another txq, he just need to change the txq_queue_id parameter
> in the API.
> >>>> 2. If you want one rxq to support two txq at the same time, this
> >>>> needs to add spinlock on guard variable to avoid multi-thread conflict.
> >>>> Spinlock will decrease the data-path performance greatly.  Thus, we
> >>>> do not consider
> >>>> 1 rxq mapping multiple txqs here.
> >>>
> >>> I am talking about situation when one thread controls 2 tx queues.
> >>>
> >>>> + *
> >>>> + * @param rx_port_id
> >>>> + * Port identifying the receive side.
> >>>> + * @param rx_queue_id
> >>>> + * The index of the receive queue identifying the receive side.
> >>>> + * The value must be in the range [0, nb_rx_queue - 1] previously
> >>>> +supplied
> >>>> + * to rte_eth_dev_configure().
> >>>> + * @param tx_port_id
> >>>> + * Port identifying the transmit side.
> >>>> + * @param tx_queue_id
> >>>> + * The index of the transmit queue identifying the transmit side.
> >>>> + * The value must be in the range [0, nb_tx_queue - 1] previously
> >>>> +supplied
> >>>> + * to rte_eth_dev_configure().
> >>>> + * @param recycle_rxq_info
> >>>> + * A pointer to a structure of type *rte_eth_recycle_rxq_info*
> >>>> +which contains
> >>>> + * the information of the Rx queue mbuf ring.
> >>>> + * @return
> >>>> + * The number of recycling mbufs.
> >>>> + */
> >>>> +__rte_experimental
> >>>> +static inline uint16_t
> >>>> +rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
> >>>> +uint16_t tx_port_id, uint16_t tx_queue_id,  struct
> >>>> +rte_eth_recycle_rxq_info *recycle_rxq_info) {  struct
> >>>> +rte_eth_fp_ops *p;  void *qd;  uint16_t nb_mbufs;
> >>>> +
> >>>> +#ifdef RTE_ETHDEV_DEBUG_TX
> >>>> + if (tx_port_id >= RTE_MAX_ETHPORTS ||  tx_queue_id >=
> >>>> +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR,  "Invalid
> >>>> +tx_port_id=%u or tx_queue_id=%u\n",  tx_port_id, tx_queue_id);
> >>>> +return 0;  } #endif
> >>>> +
> >>>> + /* fetch pointer to queue data */ p =
> >>>> + &rte_eth_fp_ops[tx_port_id]; qd = p->txq.data[tx_queue_id];
> >>>> +
> >>>> +#ifdef RTE_ETHDEV_DEBUG_TX
> >>>> + RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
> >>>> +
> >>>> + if (qd == NULL) {
> >>>> + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> >>>> +tx_queue_id, tx_port_id);  return 0;  } #endif  if
> >>>> +(p->recycle_tx_mbufs_reuse == NULL)  return 0;
> >>>> +
> >>>> + /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
> >>>> + * into Rx mbuf ring.
> >>>> + */
> >>>> + nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
> >>>> +
> >>>> + /* If no recycling mbufs, return 0. */ if (nb_mbufs == 0) return
> >>>> + 0;
> >>>> +
> >>>> +#ifdef RTE_ETHDEV_DEBUG_RX
> >>>> + if (rx_port_id >= RTE_MAX_ETHPORTS ||  rx_queue_id >=
> >>>> +RTE_MAX_QUEUES_PER_PORT) {  RTE_ETHDEV_LOG(ERR, "Invalid
> >>>> +rx_port_id=%u or rx_queue_id=%u\n",  rx_port_id, rx_queue_id);
> >>>> +return 0;  } #endif
> >>>> +
> >>>> + /* fetch pointer to queue data */ p =
> >>>> + &rte_eth_fp_ops[rx_port_id]; qd = p->rxq.data[rx_queue_id];
> >>>> +
> >>>> +#ifdef RTE_ETHDEV_DEBUG_RX
> >>>> + RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
> >>>> +
> >>>> + if (qd == NULL) {
> >>>> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> >>>> +rx_queue_id, rx_port_id);  return 0;  } #endif
> >>>> +
> >>>> + if (p->recycle_rx_descriptors_refill == NULL) return 0;
> >>>> +
> >>>> + /* Replenish the Rx descriptors with the recycling
> >>>> + * into Rx mbuf ring.
> >>>> + */
> >>>> + p->recycle_rx_descriptors_refill(qd, nb_mbufs);
> >>>> +
> >>>> + return nb_mbufs;
> >>>> +}
> >>>> +
> >>>>  /**
> >>>>   * @warning
> >>>>   * @b EXPERIMENTAL: this API may change without prior notice diff
> >>>> --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> >>>> index dcf8adab92..a2e6ea6b6c 100644
> >>>> --- a/lib/ethdev/rte_ethdev_core.h
> >>>> +++ b/lib/ethdev/rte_ethdev_core.h
> >>>> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void
> >>>> *rxq, uint16_t offset);
> >>>>  /** @internal Check the status of a Tx descriptor */  typedef int
> >>>> (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
> >>>>
> >>>> +/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring
> >>>> +*/ typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
> >>>> +struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> >>>> +
> >>>> +/** @internal Refill Rx descriptors with the recycling mbufs */
> >>>> +typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq,
> >>>> +uint16_t nb);
> >>>> +
> >>>>  /**
> >>>>   * @internal
> >>>>   * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> >>>> @@
> >>>> -90,9 +97,11 @@ struct rte_eth_fp_ops {
> >>>>          eth_rx_queue_count_t rx_queue_count;
> >>>>          /** Check the status of a Rx descriptor. */
> >>>>          eth_rx_descriptor_status_t rx_descriptor_status;
> >>>> + /** Refill Rx descriptors with the recycling mbufs. */
> >>>> + eth_recycle_rx_descriptors_refill_t
> >>>> + recycle_rx_descriptors_refill;
> >>>> I am afraid we can't put new fields here without ABI breakage.
> >>>>
> >>>> Agree
> >>>>
> >>>> It has to be below rxq.
> >>>> Now thinking about current layout probably not the best one, and
> >>>> when introducing this struct, I should probably put rxq either on
> >>>> the top of the struct, or on the next cache line.
> >>>> But such change is not possible right now anyway.
> >>>> Same story for txq.
> >>>>
> >>>> Thus we should rearrange the structure like below:
> >>>> struct rte_eth_fp_ops {
> >>>>     struct rte_ethdev_qdata rxq;
> >>>>          eth_rx_burst_t rx_pkt_burst;
> >>>>          eth_rx_queue_count_t rx_queue_count;
> >>>>          eth_rx_descriptor_status_t rx_descriptor_status;
> >>>>        eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
> >>>>               uintptr_t reserved1[2]; }
> >>>
> >>> Yes, I think such layout will be better.
> >>> The only problem here - we have to wait for 23.11 for that.
> >>>
> >> Ok, if not this change, maybe we still need to wait. Because
> >> mbufs_recycle have other ABI breakage. Such as the change for 'struct
> rte_eth_dev'.
> >
> > Ok by me.
> >
> >>>>
> >>>>
> >>>>          /** Rx queues data. */
> >>>>          struct rte_ethdev_qdata rxq;
> >>>> - uintptr_t reserved1[3];
> >>>> + uintptr_t reserved1[2];
> >>>>          /**@}*/
> >>>>
> >>>>          /**@{*/
> >>>> @@ -106,9 +115,11 @@ struct rte_eth_fp_ops {
> >>>>          eth_tx_prep_t tx_pkt_prepare;
> >>>>          /** Check the status of a Tx descriptor. */
> >>>>          eth_tx_descriptor_status_t tx_descriptor_status;
> >>>> + /** Copy used mbufs from Tx mbuf ring into Rx. */
> >>>> + eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
> >>>>          /** Tx queues data. */
> >>>>          struct rte_ethdev_qdata txq;
> >>>> - uintptr_t reserved2[3];
> >>>> + uintptr_t reserved2[2];
> >>>>          /**@}*/
> >>>>
> >>>>  } __rte_cache_aligned;
> >>>> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> >>>> 357d1a88c0..45c417f6bd 100644
> >>>> --- a/lib/ethdev/version.map
> >>>> +++ b/lib/ethdev/version.map
> >>>> @@ -299,6 +299,10 @@ EXPERIMENTAL {
> >>>>          rte_flow_action_handle_query_update;
> >>>>          rte_flow_async_action_handle_query_update;
> >>>>          rte_flow_async_create_by_index;
> >>>> +
> >>>> + # added in 23.07
> >>>> + rte_eth_recycle_mbufs;
> >>>> + rte_eth_recycle_rx_queue_info_get;
> >>>>  };
> >>>>
> >>>>  INTERNAL {
> >>>> --
> >>>> 2.25.1
> >>>>


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2023-06-12  3:25 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-24 16:46 [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Feifei Wang
2021-12-24 16:46 ` [RFC PATCH v1 1/4] net/i40e: enable direct re-arm mode Feifei Wang
2021-12-24 16:46 ` [RFC PATCH v1 2/4] ethdev: add API for " Feifei Wang
2021-12-24 19:38   ` Stephen Hemminger
2021-12-26  9:49     ` 回复: " Feifei Wang
2021-12-26 10:31       ` Morten Brørup
2021-12-24 16:46 ` [RFC PATCH v1 3/4] net/i40e: add direct re-arm mode internal API Feifei Wang
2021-12-24 16:46 ` [RFC PATCH v1 4/4] examples/l3fwd: give an example for direct rearm mode Feifei Wang
2021-12-26 10:25 ` [RFC PATCH v1 0/4] Direct re-arming of buffers on receive side Morten Brørup
2021-12-28  6:55   ` 回复: " Feifei Wang
2022-01-18 15:51     ` Ferruh Yigit
2022-01-18 16:53       ` Thomas Monjalon
2022-01-18 17:27         ` Morten Brørup
2022-01-27  5:24           ` Honnappa Nagarahalli
2022-01-27 16:45             ` Ananyev, Konstantin
2022-02-02 19:46               ` Honnappa Nagarahalli
2022-01-27  5:16         ` Honnappa Nagarahalli
2023-02-28  6:43       ` 回复: " Feifei Wang
2023-02-28  6:52         ` Feifei Wang
2022-01-27  4:06   ` Honnappa Nagarahalli
2022-01-27 17:13     ` Morten Brørup
2022-01-28 11:29     ` Morten Brørup
2023-03-23 10:43 ` [PATCH v4 0/3] Recycle buffers from Tx to Rx Feifei Wang
2023-03-23 10:43   ` [PATCH v4 1/3] ethdev: add API for buffer recycle mode Feifei Wang
2023-03-23 11:41     ` Morten Brørup
2023-03-29  2:16       ` Feifei Wang
2023-03-23 10:43   ` [PATCH v4 2/3] net/i40e: implement recycle buffer mode Feifei Wang
2023-03-23 10:43   ` [PATCH v4 3/3] net/ixgbe: " Feifei Wang
2023-03-30  6:29 ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Feifei Wang
2023-03-30  6:29   ` [PATCH v5 1/3] ethdev: add API for buffer recycle mode Feifei Wang
2023-03-30  7:19     ` Morten Brørup
2023-03-30  9:31       ` Feifei Wang
2023-03-30 15:15         ` Morten Brørup
2023-03-30 15:58         ` Morten Brørup
2023-04-26  6:59           ` Feifei Wang
2023-04-19 14:46     ` Ferruh Yigit
2023-04-26  7:29       ` Feifei Wang
2023-03-30  6:29   ` [PATCH v5 2/3] net/i40e: implement recycle buffer mode Feifei Wang
2023-03-30  6:29   ` [PATCH v5 3/3] net/ixgbe: " Feifei Wang
2023-04-19 14:46     ` Ferruh Yigit
2023-04-26  7:36       ` Feifei Wang
2023-03-30 15:04   ` [PATCH v5 0/3] Recycle buffers from Tx to Rx Stephen Hemminger
2023-04-03  2:48     ` Feifei Wang
2023-04-19 14:56   ` Ferruh Yigit
2023-04-25  7:57     ` Feifei Wang
2023-05-25  9:45 ` [PATCH v6 0/4] Recycle mbufs from Tx queue to Rx queue Feifei Wang
2023-05-25  9:45   ` [PATCH v6 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
2023-05-25 15:08     ` Morten Brørup
2023-05-31  6:10       ` Feifei Wang
2023-06-05 12:53     ` Константин Ананьев
2023-06-06  2:55       ` Feifei Wang
2023-06-06  7:10         ` Konstantin Ananyev
2023-06-06  7:31           ` Feifei Wang
2023-06-06  8:34             ` Konstantin Ananyev
2023-06-07  0:00               ` Ferruh Yigit
2023-06-12  3:25                 ` Feifei Wang
2023-05-25  9:45   ` [PATCH v6 2/4] net/i40e: implement " Feifei Wang
2023-06-05 13:02     ` Константин Ананьев
2023-06-06  3:16       ` Feifei Wang
2023-06-06  7:18         ` Konstantin Ananyev
2023-06-06  7:58           ` Feifei Wang
2023-06-06  8:27             ` Konstantin Ananyev
2023-06-12  3:05               ` Feifei Wang
2023-05-25  9:45   ` [PATCH v6 3/4] net/ixgbe: " Feifei Wang
2023-05-25  9:45   ` [PATCH v6 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
2023-06-05 13:08     ` Константин Ананьев
2023-06-06  6:32       ` Feifei Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).