DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH v1 0/5] Direct re-arming of buffers on receive side
@ 2022-04-20  8:16 Feifei Wang
  2022-04-20  8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraints:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
  Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
  side.
3)L3fwd application is modified to do the direct rearm mapping
automatically without user config. This follows the rules that the
thread can map TX queue to a RX queue based on the first received
package destination port.

Testing status:
1.The testing results for L3fwd are as follows:
-------------------------------------------------------------------
enabled direct rearm
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
without fast-free mode		with fast-free mode
	+14.1%				+7.0%

Ampere Altra(neon path):
without fast-free mode		with fast-free mode
	+17.1				+14.0%

X86:
Dell-8268(limit frequency):
sse path:
without fast-free mode		with fast-free mode
	+6.96%				+2.02%
avx2 path:
without fast-free mode		with fast-free mode
	+9.04%				+7.75%
avx512 path:
without fast-free mode		with fast-free mode
	+5.43%				+1.57%
-------------------------------------------------------------------
This patch can not affect base performance of normal mode.
Furthermore, the reason for that limiting the CPU frequency is
that dell-8268 can encounter i40e NIC bottleneck with maximum
frequency.

2.The testing results for VPP-L3fwd are as follows:
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
with direct re-arm mode enabled
	+7.0%
-------------------------------------------------------------------
For Ampere Altra and X86,VPP-L3fwd test has not been done.

Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (5):
  net/i40e: remove redundant Dtype initialization
  net/i40e: enable direct rearm mode
  ethdev: add API for direct rearm mode
  net/i40e: add direct rearm mode internal API
  examples/l3fwd: enable direct rearm mode

 drivers/net/i40e/i40e_ethdev.c          |  34 +++
 drivers/net/i40e/i40e_rxtx.c            |   4 -
 drivers/net/i40e/i40e_rxtx.h            |   4 +
 drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
 drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
 examples/l3fwd/l3fwd_lpm.c              |  16 +-
 lib/ethdev/ethdev_driver.h              |  15 ++
 lib/ethdev/rte_ethdev.c                 |  14 ++
 lib/ethdev/rte_ethdev.h                 |  31 +++
 lib/ethdev/version.map                  |   1 +
 13 files changed, 949 insertions(+), 13 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
@ 2022-04-20  8:16 ` Feifei Wang
  2022-04-20  8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  To: Beilei Xing; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

The Dtype field is set to 0xf by the NIC to indicate DMA completion, only
after the CPU requests to be informed by setting the RS bit. Hence, it is
not required to set Dtype to 0xf during initialization.

Not setting the Dtype field to 0xf helps to know that a given descriptor
is not sent to the NIC yet after initialization.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 25a28ecea2..745734d5e4 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -2767,10 +2767,6 @@ i40e_reset_tx_queue(struct i40e_tx_queue *txq)
 
 	prev = (uint16_t)(txq->nb_tx_desc - 1);
 	for (i = 0; i < txq->nb_tx_desc; i++) {
-		volatile struct i40e_tx_desc *txd = &txq->tx_ring[i];
-
-		txd->cmd_type_offset_bsz =
-			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE);
 		txe[i].mbuf =  NULL;
 		txe[i].last_id = i;
 		txe[prev].next_id = i;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 2/5] net/i40e: enable direct rearm mode
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
  2022-04-20  8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
@ 2022-04-20  8:16 ` Feifei Wang
  2022-05-11 22:28   ` Konstantin Ananyev
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  To: Beilei Xing, Bruce Richardson, Konstantin Ananyev, Ruifeng Wang
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli

For i40e driver, enable direct re-arm mode. This patch supports the case
of mapping Rx/Tx queues from the same single lcore.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_rxtx.h            |   4 +
 drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
 drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
 drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
 drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
 6 files changed, 839 insertions(+), 8 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..1fdf4305f4 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -102,6 +102,8 @@ struct i40e_rx_queue {
 
 	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
 	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
+	uint16_t direct_rxrearm_port; /** device TX port ID for direct re-arm mode */
+	uint16_t direct_rxrearm_queue; /** TX queue index for direct re-arm mode */
 	uint64_t mbuf_initializer; /**< value to init mbufs */
 
 	uint16_t port_id; /**< device port ID */
@@ -121,6 +123,8 @@ struct i40e_rx_queue {
 	uint16_t rx_using_sse; /**<flag indicate the usage of vPMD for rx */
 	uint8_t dcb_tc;         /**< Traffic class of rx queue */
 	uint64_t offloads; /**< Rx offload flags of RTE_ETH_RX_OFFLOAD_* */
+	/**<  0 if direct re-arm mode disabled, 1 when enabled */
+	bool direct_rxrearm_enable;
 	const struct rte_memzone *mz;
 };
 
diff --git a/drivers/net/i40e/i40e_rxtx_common_avx.h b/drivers/net/i40e/i40e_rxtx_common_avx.h
index cfc1e63173..a742723e07 100644
--- a/drivers/net/i40e/i40e_rxtx_common_avx.h
+++ b/drivers/net/i40e/i40e_rxtx_common_avx.h
@@ -209,6 +209,275 @@ i40e_rxq_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
 	/* Update the tail pointer on the NIC */
 	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
 }
+
+static __rte_always_inline void
+i40e_rxq_direct_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
+{
+	struct rte_eth_dev *dev;
+	struct i40e_tx_queue *txq;
+	volatile union i40e_rx_desc *rxdp;
+	struct i40e_tx_entry *txep;
+	struct i40e_rx_entry *rxep;
+	struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
+	uint16_t tx_port_id, tx_queue_id;
+	uint16_t rx_id;
+	uint16_t i, n;
+	uint16_t nb_rearm = 0;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	tx_port_id = rxq->direct_rxrearm_port;
+	tx_queue_id = rxq->direct_rxrearm_queue;
+	dev = &rte_eth_devices[tx_port_id];
+	txq = dev->data->tx_queues[tx_queue_id];
+
+	/* check Rx queue is able to take in the whole
+	 * batch of free mbufs from Tx queue
+	 */
+	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+		/* check DD bits on threshold descriptor */
+		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+			goto mempool_bulk;
+		}
+
+		if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
+			goto mempool_bulk;
+
+		n = txq->tx_rs_thresh;
+
+		/* first buffer to free from S/W ring is at index
+		 * tx_next_dd - (tx_rs_thresh-1)
+		 */
+		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+			/* directly put mbufs from Tx to Rx,
+			 * and initialize the mbufs in vector
+			 */
+			for (i = 0; i < n; i++)
+				rxep[i].mbuf = txep[i].mbuf;
+		} else {
+			for (i = 0; i < n; i++) {
+				m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+				/* ensure each Tx freed buffer is valid */
+				if (m[i] != NULL)
+					nb_rearm++;
+			}
+
+			if (nb_rearm != n) {
+				txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+				txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+				if (txq->tx_next_dd >= txq->nb_tx_desc)
+					txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+				goto mempool_bulk;
+			} else {
+				for (i = 0; i < n; i++)
+					rxep[i].mbuf = m[i];
+			}
+		}
+
+		/* update counters for Tx */
+		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+		if (txq->tx_next_dd >= txq->nb_tx_desc)
+			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+	} else {
+mempool_bulk:
+		/* if TX did not free bufs into Rx sw-ring,
+		 * get new bufs from mempool
+		 */
+		n = RTE_I40E_RXQ_REARM_THRESH;
+
+		/* Pull 'n' more MBUFs into the software ring */
+		if (rte_mempool_get_bulk(rxq->mp,
+					(void *)rxep,
+					RTE_I40E_RXQ_REARM_THRESH) < 0) {
+			if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
+				rxq->nb_rx_desc) {
+				__m128i dma_addr0;
+				dma_addr0 = _mm_setzero_si128();
+				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+					rxep[i].mbuf = &rxq->fake_mbuf;
+					_mm_store_si128((__m128i *)&rxdp[i].read,
+							dma_addr0);
+				}
+			}
+			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+				RTE_I40E_RXQ_REARM_THRESH;
+			return;
+		}
+	}
+
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+	struct rte_mbuf *mb0, *mb1;
+	__m128i dma_addr0, dma_addr1;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+	for (i = 0; i < n; i += 2, rxep += 2) {
+		__m128i vaddr0, vaddr1;
+
+		mb0 = rxep[0].mbuf;
+		mb1 = rxep[1].mbuf;
+
+		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+				offsetof(struct rte_mbuf, buf_addr) + 8);
+		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+		/* convert pa to dma_addr hdr/data */
+		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+		/* flush desc with pa dma_addr */
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+	}
+#else
+#ifdef __AVX512VL__
+	if (avx512) {
+		struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+		struct rte_mbuf *mb4, *mb5, *mb6, *mb7;
+		__m512i dma_addr0_3, dma_addr4_7;
+		__m512i hdr_room = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
+		/* Initialize the mbufs in vector, process 8 mbufs in one loop */
+		for (i = 0; i < n; i += 8, rxep += 8, rxdp += 8) {
+			__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+			__m128i vaddr4, vaddr5, vaddr6, vaddr7;
+			__m256i vaddr0_1, vaddr2_3;
+			__m256i vaddr4_5, vaddr6_7;
+			__m512i vaddr0_3, vaddr4_7;
+
+			mb0 = rxep[0].mbuf;
+			mb1 = rxep[1].mbuf;
+			mb2 = rxep[2].mbuf;
+			mb3 = rxep[3].mbuf;
+			mb4 = rxep[4].mbuf;
+			mb5 = rxep[5].mbuf;
+			mb6 = rxep[6].mbuf;
+			mb7 = rxep[7].mbuf;
+
+			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+					offsetof(struct rte_mbuf, buf_addr) + 8);
+			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+			vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+			vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+			vaddr4 = _mm_loadu_si128((__m128i *)&mb4->buf_addr);
+			vaddr5 = _mm_loadu_si128((__m128i *)&mb5->buf_addr);
+			vaddr6 = _mm_loadu_si128((__m128i *)&mb6->buf_addr);
+			vaddr7 = _mm_loadu_si128((__m128i *)&mb7->buf_addr);
+
+			/**
+			 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+			 * into the high lanes. Similarly for 2 & 3, and so on.
+			 */
+			vaddr0_1 =
+				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+							vaddr1, 1);
+			vaddr2_3 =
+				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+							vaddr3, 1);
+			vaddr4_5 =
+				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr4),
+							vaddr5, 1);
+			vaddr6_7 =
+				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr6),
+							vaddr7, 1);
+			vaddr0_3 =
+				_mm512_inserti64x4(_mm512_castsi256_si512(vaddr0_1),
+						   vaddr2_3, 1);
+			vaddr4_7 =
+				_mm512_inserti64x4(_mm512_castsi256_si512(vaddr4_5),
+						   vaddr6_7, 1);
+
+			/* convert pa to dma_addr hdr/data */
+			dma_addr0_3 = _mm512_unpackhi_epi64(vaddr0_3, vaddr0_3);
+			dma_addr4_7 = _mm512_unpackhi_epi64(vaddr4_7, vaddr4_7);
+
+			/* add headroom to pa values */
+			dma_addr0_3 = _mm512_add_epi64(dma_addr0_3, hdr_room);
+			dma_addr4_7 = _mm512_add_epi64(dma_addr4_7, hdr_room);
+
+			/* flush desc with pa dma_addr */
+			_mm512_store_si512((__m512i *)&rxdp->read, dma_addr0_3);
+			_mm512_store_si512((__m512i *)&(rxdp + 4)->read, dma_addr4_7);
+		}
+	} else {
+#endif /* __AVX512VL__*/
+		struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+		__m256i dma_addr0_1, dma_addr2_3;
+		__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+		/* Initialize the mbufs in vector, process 4 mbufs in one loop */
+		for (i = 0; i < n; i += 4, rxep += 4, rxdp += 4) {
+			__m128i vaddr0, vaddr1, vaddr2, vaddr3;
+			__m256i vaddr0_1, vaddr2_3;
+
+			mb0 = rxep[0].mbuf;
+			mb1 = rxep[1].mbuf;
+			mb2 = rxep[2].mbuf;
+			mb3 = rxep[3].mbuf;
+
+			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+					offsetof(struct rte_mbuf, buf_addr) + 8);
+			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+			vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+			vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+			/**
+			 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+			 * into the high lanes. Similarly for 2 & 3
+			 */
+			vaddr0_1 = _mm256_inserti128_si256
+				(_mm256_castsi128_si256(vaddr0), vaddr1, 1);
+			vaddr2_3 = _mm256_inserti128_si256
+				(_mm256_castsi128_si256(vaddr2), vaddr3, 1);
+
+			/* convert pa to dma_addr hdr/data */
+			dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+			dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+			/* add headroom to pa values */
+			dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+			dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+			/* flush desc with pa dma_addr */
+			_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+			_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+		}
+	}
+
+#endif
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += n;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+		if (!rxq->rxrearm_start)
+			rx_id = rxq->nb_rx_desc - 1;
+		else
+			rx_id = rxq->rxrearm_start - 1;
+	}
+
+	rxq->rxrearm_nb -= n;
+
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
+}
 #endif /* __AVX2__*/
 
 #endif /*_I40E_RXTX_COMMON_AVX_H_*/
diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
index c73b2a321b..fcb7ba0273 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
@@ -25,6 +25,12 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 	return i40e_rxq_rearm_common(rxq, false);
 }
 
+static __rte_always_inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+	return i40e_rxq_direct_rearm_common(rxq, false);
+}
+
 #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
 /* Handles 32B descriptor FDIR ID processing:
  * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
@@ -128,8 +134,12 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 	/* See if we need to rearm the RX queue - gives the prefetch a bit
 	 * of time to act
 	 */
-	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
-		i40e_rxq_rearm(rxq);
+	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+		if (rxq->direct_rxrearm_enable)
+			i40e_rxq_direct_rearm(rxq);
+		else
+			i40e_rxq_rearm(rxq);
+	}
 
 	/* Before we start moving massive data around, check to see if
 	 * there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx512.c b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
index 2e8a3f0df6..d967095edc 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_avx512.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
@@ -21,6 +21,12 @@
 
 #define RTE_I40E_DESCS_PER_LOOP_AVX 8
 
+enum i40e_direct_rearm_type_value {
+	I40E_DIRECT_REARM_TYPE_NORMAL		= 0x0,
+	I40E_DIRECT_REARM_TYPE_FAST_FREE	= 0x1,
+	I40E_DIRECT_REARM_TYPE_PRE_FREE		= 0x2,
+};
+
 static __rte_always_inline void
 i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 {
@@ -150,6 +156,241 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
 }
 
+static __rte_always_inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+	struct rte_eth_dev *dev;
+	struct i40e_tx_queue *txq;
+	volatile union i40e_rx_desc *rxdp;
+	struct i40e_vec_tx_entry *txep;
+	struct i40e_rx_entry *rxep;
+	struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
+	uint16_t tx_port_id, tx_queue_id;
+	uint16_t rx_id;
+	uint16_t i, n;
+	uint16_t j = 0;
+	uint16_t nb_rearm = 0;
+	enum i40e_direct_rearm_type_value type;
+	struct rte_mempool_cache *cache = NULL;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	tx_port_id = rxq->direct_rxrearm_port;
+	tx_queue_id = rxq->direct_rxrearm_queue;
+	dev = &rte_eth_devices[tx_port_id];
+	txq = dev->data->tx_queues[tx_queue_id];
+
+	/* check Rx queue is able to take in the whole
+	 * batch of free mbufs from Tx queue
+	 */
+	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+		/* check DD bits on threshold descriptor */
+		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+			rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+			goto mempool_bulk;
+		}
+
+		if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
+			goto mempool_bulk;
+
+		n = txq->tx_rs_thresh;
+
+		/* first buffer to free from S/W ring is at index
+		 * tx_next_dd - (tx_rs_thresh-1)
+		 */
+		txep = (void *)txq->sw_ring;
+		txep += txq->tx_next_dd - (n - 1);
+
+		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+			/* directly put mbufs from Tx to Rx */
+			uint32_t copied = 0;
+			/* n is multiple of 32 */
+			while (copied < n) {
+				const __m512i a = _mm512_load_si512(&txep[copied]);
+				const __m512i b = _mm512_load_si512(&txep[copied + 8]);
+				const __m512i c = _mm512_load_si512(&txep[copied + 16]);
+				const __m512i d = _mm512_load_si512(&txep[copied + 24]);
+
+				_mm512_storeu_si512(&rxep[copied], a);
+				_mm512_storeu_si512(&rxep[copied + 8], b);
+				_mm512_storeu_si512(&rxep[copied + 16], c);
+				_mm512_storeu_si512(&rxep[copied + 24], d);
+				copied += 32;
+			}
+			type = I40E_DIRECT_REARM_TYPE_FAST_FREE;
+		} else {
+			for (i = 0; i < n; i++) {
+				m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+				/* ensure each Tx freed buffer is valid */
+				if (m[i] != NULL)
+					nb_rearm++;
+			}
+
+			if (nb_rearm != n) {
+				txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+				txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+				if (txq->tx_next_dd >= txq->nb_tx_desc)
+					txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+				goto mempool_bulk;
+			} else {
+				type = I40E_DIRECT_REARM_TYPE_PRE_FREE;
+			}
+		}
+
+	/* update counters for Tx */
+	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+	if (txq->tx_next_dd >= txq->nb_tx_desc)
+		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+	} else {
+mempool_bulk:
+		cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id());
+
+		if (unlikely(!cache))
+			return i40e_rxq_rearm_common(rxq, true);
+
+		n = RTE_I40E_RXQ_REARM_THRESH;
+
+		/* We need to pull 'n' more MBUFs into the software ring from mempool
+		 * We inline the mempool function here, so we can vectorize the copy
+		 * from the cache into the shadow ring.
+		 */
+
+		if (cache->len < RTE_I40E_RXQ_REARM_THRESH) {
+			/* No. Backfill the cache first, and then fill from it */
+			uint32_t req = RTE_I40E_RXQ_REARM_THRESH + (cache->size -
+					cache->len);
+
+			/* How many do we require
+			 * i.e. number to fill the cache + the request
+			 */
+			int ret = rte_mempool_ops_dequeue_bulk(rxq->mp,
+					&cache->objs[cache->len], req);
+			if (ret == 0) {
+				cache->len += req;
+			} else {
+				if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
+						rxq->nb_rx_desc) {
+					__m128i dma_addr0;
+
+					dma_addr0 = _mm_setzero_si128();
+					for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+						rxep[i].mbuf = &rxq->fake_mbuf;
+						_mm_store_si128
+							((__m128i *)&rxdp[i].read,
+								dma_addr0);
+					}
+				}
+				rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+						RTE_I40E_RXQ_REARM_THRESH;
+				return;
+			}
+		}
+
+		type = I40E_DIRECT_REARM_TYPE_NORMAL;
+	}
+
+	const __m512i iova_offsets =  _mm512_set1_epi64
+		(offsetof(struct rte_mbuf, buf_iova));
+	const __m512i headroom = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
+
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+	/* to shuffle the addresses to correct slots. Values 4-7 will contain
+	 * zeros, so use 7 for a zero-value.
+	 */
+	const __m512i permute_idx = _mm512_set_epi64(7, 7, 3, 1, 7, 7, 2, 0);
+#else
+	const __m512i permute_idx = _mm512_set_epi64(7, 3, 6, 2, 5, 1, 4, 0);
+#endif
+
+	__m512i mbuf_ptrs;
+
+	/* Initialize the mbufs in vector, process 8 mbufs in one loop, taking
+	 * from mempool cache and populating both shadow and HW rings
+	 */
+	for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH / 8; i++) {
+		switch (type) {
+		case I40E_DIRECT_REARM_TYPE_FAST_FREE:
+			mbuf_ptrs = _mm512_loadu_si512(rxep);
+			break;
+		case I40E_DIRECT_REARM_TYPE_PRE_FREE:
+			mbuf_ptrs = _mm512_loadu_si512(&m[j]);
+			_mm512_store_si512(rxep, mbuf_ptrs);
+			j += 8;
+			break;
+		case I40E_DIRECT_REARM_TYPE_NORMAL:
+			mbuf_ptrs = _mm512_loadu_si512
+				(&cache->objs[cache->len - 8]);
+			_mm512_store_si512(rxep, mbuf_ptrs);
+			cache->len -= 8;
+			break;
+		}
+
+		/* gather iova of mbuf0-7 into one zmm reg */
+		const __m512i iova_base_addrs = _mm512_i64gather_epi64
+			(_mm512_add_epi64(mbuf_ptrs, iova_offsets),
+				0, /* base */
+				1 /* scale */);
+		const __m512i iova_addrs = _mm512_add_epi64(iova_base_addrs,
+				headroom);
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+		const __m512i iovas0 = _mm512_castsi256_si512
+			(_mm512_extracti64x4_epi64(iova_addrs, 0));
+		const __m512i iovas1 = _mm512_castsi256_si512
+			(_mm512_extracti64x4_epi64(iova_addrs, 1));
+
+		/* permute leaves desc 2-3 addresses in header address slots 0-1
+		 * but these are ignored by driver since header split not
+		 * enabled. Similarly for desc 4 & 5.
+		 */
+		const __m512i desc_rd_0_1 = _mm512_permutexvar_epi64
+			(permute_idx, iovas0);
+		const __m512i desc_rd_2_3 = _mm512_bsrli_epi128(desc_rd_0_1, 8);
+
+		const __m512i desc_rd_4_5 = _mm512_permutexvar_epi64
+			(permute_idx, iovas1);
+		const __m512i desc_rd_6_7 = _mm512_bsrli_epi128(desc_rd_4_5, 8);
+
+		_mm512_store_si512((void *)rxdp, desc_rd_0_1);
+		_mm512_store_si512((void *)(rxdp + 2), desc_rd_2_3);
+		_mm512_store_si512((void *)(rxdp + 4), desc_rd_4_5);
+		_mm512_store_si512((void *)(rxdp + 6), desc_rd_6_7);
+#else
+		/* permute leaves desc 4-7 addresses in header address slots 0-3
+		 * but these are ignored by driver since header split not
+		 * enabled.
+		 */
+		const __m512i desc_rd_0_3 = _mm512_permutexvar_epi64
+			(permute_idx, iova_addrs);
+		const __m512i desc_rd_4_7 = _mm512_bsrli_epi128(desc_rd_0_3, 8);
+
+		_mm512_store_si512((void *)rxdp, desc_rd_0_3);
+		_mm512_store_si512((void *)(rxdp + 4), desc_rd_4_7);
+#endif
+		rxdp += 8, rxep += 8;
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += n;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+		if (!rxq->rxrearm_start)
+			rx_id = rxq->nb_rx_desc - 1;
+		else
+			rx_id = rxq->rxrearm_start - 1;
+	}
+
+	rxq->rxrearm_nb -= n;
+
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
+}
+
 #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
 /* Handles 32B descriptor FDIR ID processing:
  * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
@@ -252,8 +493,12 @@ _recv_raw_pkts_vec_avx512(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 	/* See if we need to rearm the RX queue - gives the prefetch a bit
 	 * of time to act
 	 */
-	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
-		i40e_rxq_rearm(rxq);
+	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+		if (rxq->direct_rxrearm_enable)
+			i40e_rxq_direct_rearm(rxq);
+		else
+			i40e_rxq_rearm(rxq);
+	}
 
 	/* Before we start moving massive data around, check to see if
 	 * there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index fa9e6582c5..dc78e3c90b 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -77,6 +77,139 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
 }
 
+static inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+	struct rte_eth_dev *dev;
+	struct i40e_tx_queue *txq;
+	volatile union i40e_rx_desc *rxdp;
+	struct i40e_tx_entry *txep;
+	struct i40e_rx_entry *rxep;
+	uint16_t tx_port_id, tx_queue_id;
+	uint16_t rx_id;
+	struct rte_mbuf *mb0, *mb1, *m;
+	uint64x2_t dma_addr0, dma_addr1;
+	uint64x2_t zero = vdupq_n_u64(0);
+	uint64_t paddr;
+	uint16_t i, n;
+	uint16_t nb_rearm = 0;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	tx_port_id = rxq->direct_rxrearm_port;
+	tx_queue_id = rxq->direct_rxrearm_queue;
+	dev = &rte_eth_devices[tx_port_id];
+	txq = dev->data->tx_queues[tx_queue_id];
+
+	/* check Rx queue is able to take in the whole
+	 * batch of free mbufs from Tx queue
+	 */
+	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+		/* check DD bits on threshold descriptor */
+		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+			goto mempool_bulk;
+		}
+
+		n = txq->tx_rs_thresh;
+
+		/* first buffer to free from S/W ring is at index
+		 * tx_next_dd - (tx_rs_thresh-1)
+		 */
+		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+			/* directly put mbufs from Tx to Rx,
+			 * and initialize the mbufs in vector
+			 */
+			for (i = 0; i < n; i++, rxep++, txep++) {
+				rxep[0].mbuf = txep[0].mbuf;
+
+				/* Initialize rxdp descs */
+				mb0 = txep[0].mbuf;
+
+				paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+				dma_addr0 = vdupq_n_u64(paddr);
+				/* flush desc with pa dma_addr */
+				vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+			}
+		} else {
+			for (i = 0; i < n; i++) {
+				m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+				if (m != NULL) {
+					rxep[i].mbuf = m;
+
+					/* Initialize rxdp descs */
+					paddr = m->buf_iova + RTE_PKTMBUF_HEADROOM;
+					dma_addr0 = vdupq_n_u64(paddr);
+					/* flush desc with pa dma_addr */
+					vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+					nb_rearm++;
+				}
+			}
+			n = nb_rearm;
+		}
+
+		/* update counters for Tx */
+		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+		if (txq->tx_next_dd >= txq->nb_tx_desc)
+			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+	} else {
+mempool_bulk:
+		/* if TX did not free bufs into Rx sw-ring,
+		 * get new bufs from mempool
+		 */
+		n = RTE_I40E_RXQ_REARM_THRESH;
+		if (unlikely(rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0)) {
+			if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
+				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+					rxep[i].mbuf = &rxq->fake_mbuf;
+					vst1q_u64((uint64_t *)&rxdp[i].read, zero);
+				}
+			}
+			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed += n;
+			return;
+		}
+
+		/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+		for (i = 0; i < n; i += 2, rxep += 2) {
+			mb0 = rxep[0].mbuf;
+			mb1 = rxep[1].mbuf;
+
+			paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+			dma_addr0 = vdupq_n_u64(paddr);
+			/* flush desc with pa dma_addr */
+			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+
+			paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+			dma_addr1 = vdupq_n_u64(paddr);
+			/* flush desc with pa dma_addr */
+			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
+		}
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += n;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+		if (!rxq->rxrearm_start)
+			rx_id = rxq->nb_rx_desc - 1;
+		else
+			rx_id = rxq->rxrearm_start - 1;
+	}
+
+	rxq->rxrearm_nb -= n;
+
+	rte_io_wmb();
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
 #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
 /* NEON version of FDIR mark extraction for 4 32B descriptors at a time */
 static inline uint32x4_t
@@ -381,8 +514,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 	/* See if we need to rearm the RX queue - gives the prefetch a bit
 	 * of time to act
 	 */
-	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
-		i40e_rxq_rearm(rxq);
+	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+		if (rxq->direct_rxrearm_enable)
+			i40e_rxq_direct_rearm(rxq);
+		else
+			i40e_rxq_rearm(rxq);
+	}
 
 	/* Before we start moving massive data around, check to see if
 	 * there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
index 3782e8052f..b2f1ab2c8d 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
@@ -89,6 +89,168 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
 }
 
+static inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+	struct rte_eth_dev *dev;
+	struct i40e_tx_queue *txq;
+	volatile union i40e_rx_desc *rxdp;
+	struct i40e_tx_entry *txep;
+	struct i40e_rx_entry *rxep;
+	uint16_t tx_port_id, tx_queue_id;
+	uint16_t rx_id;
+	struct rte_mbuf *mb0, *mb1, *m;
+	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+			RTE_PKTMBUF_HEADROOM);
+	__m128i dma_addr0, dma_addr1;
+	__m128i vaddr0, vaddr1;
+	uint16_t i, n;
+	uint16_t nb_rearm = 0;
+
+	rxdp = rxq->rx_ring + rxq->rxrearm_start;
+	rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+	tx_port_id = rxq->direct_rxrearm_port;
+	tx_queue_id = rxq->direct_rxrearm_queue;
+	dev = &rte_eth_devices[tx_port_id];
+	txq = dev->data->tx_queues[tx_queue_id];
+
+	/* check Rx queue is able to take in the whole
+	 * batch of free mbufs from Tx queue
+	 */
+	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+		/* check DD bits on threshold descriptor */
+		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+			goto mempool_bulk;
+		}
+
+		n = txq->tx_rs_thresh;
+
+		/* first buffer to free from S/W ring is at index
+		 * tx_next_dd - (tx_rs_thresh-1)
+		 */
+		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+			/* directly put mbufs from Tx to Rx,
+			 * and initialize the mbufs in vector
+			 */
+			for (i = 0; i < n; i++, rxep++, txep++) {
+				rxep[0].mbuf = txep[0].mbuf;
+
+				/* Initialize rxdp descs */
+				mb0 = txep[0].mbuf;
+
+				/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+				RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+						offsetof(struct rte_mbuf, buf_addr) + 8);
+				vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+
+				/* convert pa to dma_addr hdr/data */
+				dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+
+				/* add headroom to pa values */
+				dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+
+				/* flush desc with pa dma_addr */
+				_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+			}
+		} else {
+			for (i = 0; i < n; i++) {
+				m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+				if (m != NULL) {
+					rxep[i].mbuf = m;
+
+					/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+					RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+							offsetof(struct rte_mbuf, buf_addr) + 8);
+					vaddr0 = _mm_loadu_si128((__m128i *)&m->buf_addr);
+
+					/* convert pa to dma_addr hdr/data */
+					dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+
+					/* add headroom to pa values */
+					dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+
+					/* flush desc with pa dma_addr */
+					_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+					nb_rearm++;
+				}
+			}
+			n = nb_rearm;
+		}
+
+		/* update counters for Tx */
+		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+		if (txq->tx_next_dd >= txq->nb_tx_desc)
+			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+	} else {
+mempool_bulk:
+		/* if TX did not free bufs into Rx sw-ring,
+		 * get new bufs from mempool
+		 */
+		n = RTE_I40E_RXQ_REARM_THRESH;
+		/* Pull 'n' more MBUFs into the software ring */
+		if (rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0) {
+			if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
+				dma_addr0 = _mm_setzero_si128();
+				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+					rxep[i].mbuf = &rxq->fake_mbuf;
+					_mm_store_si128((__m128i *)&rxdp[i].read,
+							dma_addr0);
+				}
+			}
+			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+				RTE_I40E_RXQ_REARM_THRESH;
+			return;
+		}
+
+		/* Initialize the mbufs in vector, process 2 mbufs in one loop */
+		for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+			mb0 = rxep[0].mbuf;
+			mb1 = rxep[1].mbuf;
+
+			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+					offsetof(struct rte_mbuf, buf_addr) + 8);
+			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+			/* convert pa to dma_addr hdr/data */
+			dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+			dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+			/* add headroom to pa values */
+			dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+			dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+			/* flush desc with pa dma_addr */
+			_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+			_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+		}
+	}
+
+	/* Update the descriptor initializer index */
+	rxq->rxrearm_start += n;
+	rx_id = rxq->rxrearm_start - 1;
+
+	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+		if (!rxq->rxrearm_start)
+			rx_id = rxq->nb_rx_desc - 1;
+		else
+			rx_id = rxq->rxrearm_start - 1;
+	}
+
+	rxq->rxrearm_nb -= n;
+
+	/* Update the tail pointer on the NIC */
+	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
 #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
 /* SSE version of FDIR mark extraction for 4 32B descriptors at a time */
 static inline __m128i
@@ -394,8 +556,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 	/* See if we need to rearm the RX queue - gives the prefetch a bit
 	 * of time to act
 	 */
-	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
-		i40e_rxq_rearm(rxq);
+	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+		if (rxq->direct_rxrearm_enable)
+			i40e_rxq_direct_rearm(rxq);
+		else
+			i40e_rxq_rearm(rxq);
+	}
 
 	/* Before we start moving massive data around, check to see if
 	 * there is actually a packet available
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
  2022-04-20  8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
  2022-04-20  8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
@ 2022-04-20  8:16 ` Feifei Wang
  2022-04-20  9:59   ` Morten Brørup
                     ` (3 more replies)
  2022-04-20  8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Add API for enabling direct rearm mode and for mapping RX and TX
queues. Currently, the API supports 1:1(txq : rxq) mapping.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
 lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
 lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
 lib/ethdev/version.map     |  1 +
 4 files changed, 61 insertions(+)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..22022f6da9 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
 typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
 				    uint16_t rx_queue_id);
 
+/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
+typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
+						uint16_t queue_id);
+
+/**< @internal map Rx/Tx queue of direct rearm mode */
+typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
+					uint16_t rx_queue_id,
+					uint16_t tx_port_id,
+					uint16_t tx_queue_id);
+
 /** @internal Release memory resources allocated by given Rx/Tx queue. */
 typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
 				    uint16_t queue_id);
@@ -1152,6 +1162,11 @@ struct eth_dev_ops {
 	/** Disable Rx queue interrupt */
 	eth_rx_disable_intr_t      rx_queue_intr_disable;
 
+	/** Enable Rx queue direct rearm mode */
+	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
+	/** Map Rx/Tx queue for direct rearm mode */
+	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
+
 	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx queue */
 	eth_queue_release_t        tx_queue_release; /**< Release Tx queue */
 	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 29a3d80466..8e6f0284f4 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
 	return eth_err(port_id, ret);
 }
 
+int
+rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+		uint16_t tx_port_id, uint16_t tx_queue_id)
+{
+	struct rte_eth_dev *dev;
+
+	dev = &rte_eth_devices[rx_port_id];
+	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
+	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
+			tx_port_id, tx_queue_id);
+
+	return 0;
+}
+
 int
 rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
 {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..4a431fcbed 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5190,6 +5190,37 @@ __rte_experimental
 int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
 				       struct rte_eth_hairpin_cap *cap);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
+ * buffers that have completed transmission on the transmit side.
+ *
+ * @note
+ *   It is assumed that the buffers have completed transmission belong to the
+ *   mempool used at the receive side, and have refcnt = 1.
+ *
+ * @param rx_port_id
+ *   Port identifying the receive side.
+ * @param rx_queue_id
+ *   The index of the receive queue identifying the receive side.
+ *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ * @param tx_port_id
+ *   Port identifying the transmit side.
+ * @param tx_queue_id
+ *   The index of the transmit queue identifying the transmit side.
+ *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ *   to rte_eth_dev_configure().
+ *
+ * @return
+ *   - (0) if successful.
+ */
+__rte_experimental
+int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+			       uint16_t tx_port_id, uint16_t tx_queue_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this structure may change without prior notice.
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab29e..68d664498c 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -279,6 +279,7 @@ EXPERIMENTAL {
 	rte_flow_async_action_handle_create;
 	rte_flow_async_action_handle_destroy;
 	rte_flow_async_action_handle_update;
+	rte_eth_direct_rxrearm_map;
 };
 
 INTERNAL {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 4/5] net/i40e: add direct rearm mode internal API
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
                   ` (2 preceding siblings ...)
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
@ 2022-04-20  8:16 ` Feifei Wang
  2022-05-11 22:31   ` Konstantin Ananyev
  2022-04-20  8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  To: Beilei Xing; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

For direct rearm mode, add two internal functions.

One is to enable direct rearm mode in Rx queue.

The other is to map Tx queue with Rx queue to make Rx queue take
buffers from the specific Tx queue.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_ethdev.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 755786dc10..9e1a523bcc 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -369,6 +369,13 @@ static int i40e_dev_rx_queue_intr_enable(struct rte_eth_dev *dev,
 static int i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev,
 					  uint16_t queue_id);
 
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+						uint16_t queue_id);
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+						uint16_t rx_queue_id,
+						uint16_t tx_port_id,
+						uint16_t tx_queue_id);
+
 static int i40e_get_regs(struct rte_eth_dev *dev,
 			 struct rte_dev_reg_info *regs);
 
@@ -477,6 +484,8 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
 	.rx_queue_setup               = i40e_dev_rx_queue_setup,
 	.rx_queue_intr_enable         = i40e_dev_rx_queue_intr_enable,
 	.rx_queue_intr_disable        = i40e_dev_rx_queue_intr_disable,
+	.rx_queue_direct_rearm_enable = i40e_dev_rx_queue_direct_rearm_enable,
+	.rx_queue_direct_rearm_map    = i40e_dev_rx_queue_direct_rearm_map,
 	.rx_queue_release             = i40e_dev_rx_queue_release,
 	.tx_queue_setup               = i40e_dev_tx_queue_setup,
 	.tx_queue_release             = i40e_dev_tx_queue_release,
@@ -11108,6 +11117,31 @@ i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
 	return 0;
 }
 
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+			uint16_t queue_id)
+{
+	struct i40e_rx_queue *rxq;
+
+	rxq = dev->data->rx_queues[queue_id];
+	rxq->direct_rxrearm_enable = 1;
+
+	return 0;
+}
+
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+				uint16_t rx_queue_id, uint16_t tx_port_id,
+				uint16_t tx_queue_id)
+{
+	struct i40e_rx_queue *rxq;
+
+	rxq = dev->data->rx_queues[rx_queue_id];
+
+	rxq->direct_rxrearm_port = tx_port_id;
+	rxq->direct_rxrearm_queue = tx_queue_id;
+
+	return 0;
+}
+
 /**
  * This function is used to check if the register is valid.
  * Below is the valid registers list for X722 only:
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
                   ` (3 preceding siblings ...)
  2022-04-20  8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
@ 2022-04-20  8:16 ` Feifei Wang
  2022-04-20 10:10   ` Morten Brørup
  2022-05-11 22:33   ` Konstantin Ananyev
  2022-05-11 23:00 ` [PATCH v1 0/5] Direct re-arming of buffers on receive side Konstantin Ananyev
       [not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
  6 siblings, 2 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-20  8:16 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang

Enable direct rearm mode. The mapping is decided in the data plane based
on the first packet received.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index bec22c44cd..38ffdf4636 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
 	unsigned lcore_id;
 	uint64_t prev_tsc, diff_tsc, cur_tsc;
 	int i, nb_rx;
-	uint16_t portid;
+	uint16_t portid, tx_portid;
 	uint8_t queueid;
 	struct lcore_conf *qconf;
 	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
@@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
 
 	const uint16_t n_rx_q = qconf->n_rx_queue;
 	const uint16_t n_tx_p = qconf->n_tx_port;
+	int direct_rearm_map[n_rx_q];
+
 	if (n_rx_q == 0) {
 		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
 		return 0;
@@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
 
 		portid = qconf->rx_queue_list[i].port_id;
 		queueid = qconf->rx_queue_list[i].queue_id;
+		direct_rearm_map[i] = 0;
 		RTE_LOG(INFO, L3FWD,
 			" -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
 			lcore_id, portid, queueid);
@@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
 			if (nb_rx == 0)
 				continue;
 
+			/* Determine the direct rearm mapping based on the first
+			 * packet received on the rx queue
+			 */
+			if (direct_rearm_map[i] == 0) {
+				tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
+							portid);
+				rte_eth_direct_rxrearm_map(portid, queueid,
+								tx_portid, queueid);
+				direct_rearm_map[i] = 1;
+			}
+
 #if defined RTE_ARCH_X86 || defined __ARM_NEON \
 			 || defined RTE_ARCH_PPC_64
 			l3fwd_lpm_send_packets(nb_rx, pkts_burst,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
@ 2022-04-20  9:59   ` Morten Brørup
  2022-04-29  2:42     ` 回复: " Feifei Wang
  2022-04-20 10:41   ` Andrew Rybchenko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Morten Brørup @ 2022-04-20  9:59 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Ray Kinsella
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 20 April 2022 10.17
> 
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
>  lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
>  lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
>  lib/ethdev/version.map     |  1 +
>  4 files changed, 61 insertions(+)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> rte_eth_dev *dev,
>  typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
>  				    uint16_t rx_queue_id);
> 
> +/** @internal Enable direct rearm of a receive queue of an Ethernet
> device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> +						uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> +					uint16_t rx_queue_id,
> +					uint16_t tx_port_id,
> +					uint16_t tx_queue_id);
> +
>  /** @internal Release memory resources allocated by given Rx/Tx queue.
> */
>  typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
>  				    uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
>  	/** Disable Rx queue interrupt */
>  	eth_rx_disable_intr_t      rx_queue_intr_disable;
> 
> +	/** Enable Rx queue direct rearm mode */
> +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;

A disable function seems to be missing.

> +	/** Map Rx/Tx queue for direct rearm mode */
> +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> +
>  	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx
> queue */
>  	eth_queue_release_t        tx_queue_release; /**< Release Tx
> queue */
>  	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring
> mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id,
> uint16_t tx_queue_id,
>  	return eth_err(port_id, ret);
>  }
> 
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	dev = &rte_eth_devices[rx_port_id];
> +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> +			tx_port_id, tx_queue_id);

Here you enable the mapping before you configure it. It could cause the driver to use an uninitialized map, if it processes packets between these two function calls.

Error handling is missing. Not all drivers support this feature, and the parameters should be validated.

Regarding driver support, the driver should also expose a capability flag to the application, similar to the RTE_ETH_DEV_CAPA_RXQ_SHARE or RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE flags. The documentation for this flag could include the description of all the restrictions to using it.

> +
> +	return 0;
> +}
> +
>  int
>  rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
>  {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
>  int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>  				       struct rte_eth_hairpin_cap *cap);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-
> armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + *   It is assumed that the buffers have completed transmission belong
> to the
> + *   mempool used at the receive side, and have refcnt = 1.
> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.
> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> + *   to rte_eth_dev_configure().
> + *
> + * @return
> + *   - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> +			       uint16_t tx_port_id, uint16_t tx_queue_id);
> +

I agree with the parameters to your proposed API here. Since the relevant use case only needs 1:1 mapping, exposing an API function to take some sort of array with N:M mappings would be premature, and probably not ever come into play anyway.

How do you remove, disable and/or change a mapping?

>  /**
>   * @warning
>   * @b EXPERIMENTAL: this structure may change without prior notice.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 20391ab29e..68d664498c 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -279,6 +279,7 @@ EXPERIMENTAL {
>  	rte_flow_async_action_handle_create;
>  	rte_flow_async_action_handle_destroy;
>  	rte_flow_async_action_handle_update;
> +	rte_eth_direct_rxrearm_map;
>  };
> 
>  INTERNAL {
> --
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
@ 2022-04-20 10:10   ` Morten Brørup
  2022-04-21  2:35     ` Honnappa Nagarahalli
  2022-05-11 22:33   ` Konstantin Ananyev
  1 sibling, 1 reply; 27+ messages in thread
From: Morten Brørup @ 2022-04-20 10:10 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 20 April 2022 10.17
> 
> Enable direct rearm mode. The mapping is decided in the data plane
> based
> on the first packet received.

I usually don't care much about l3fwd, but putting configuration changes in the fast path is just wrong!

Also, l3fwd is often used for benchmarking, and this small piece of code in the fast path will affect benchmark results (although only very little).

Please move it out of the fast path.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
  2022-04-20  9:59   ` Morten Brørup
@ 2022-04-20 10:41   ` Andrew Rybchenko
  2022-04-29  6:28     ` 回复: " Feifei Wang
  2022-04-20 10:50   ` Jerin Jacob
  2022-04-21 14:57   ` Stephen Hemminger
  3 siblings, 1 reply; 27+ messages in thread
From: Andrew Rybchenko @ 2022-04-20 10:41 UTC (permalink / raw)
  To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Ray Kinsella
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

On 4/20/22 11:16, Feifei Wang wrote:
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
>   lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
>   lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
>   lib/ethdev/version.map     |  1 +
>   4 files changed, 61 insertions(+)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
>   typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
>   				    uint16_t rx_queue_id);
>   
> +/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> +						uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> +					uint16_t rx_queue_id,
> +					uint16_t tx_port_id,
> +					uint16_t tx_queue_id);
> +
>   /** @internal Release memory resources allocated by given Rx/Tx queue. */
>   typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
>   				    uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
>   	/** Disable Rx queue interrupt */
>   	eth_rx_disable_intr_t      rx_queue_intr_disable;
>   
> +	/** Enable Rx queue direct rearm mode */
> +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> +	/** Map Rx/Tx queue for direct rearm mode */
> +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> +
>   	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx queue */
>   	eth_queue_release_t        tx_queue_release; /**< Release Tx queue */
>   	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
>   	return eth_err(port_id, ret);
>   }
>   
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	dev = &rte_eth_devices[rx_port_id];

I think it is rather control path. So:
We need standard checks that rx_port_id is valid.
tx_port_id must be checked as well.
rx_queue_id and tx_queue_id must be checked to be in the rate.

> +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> +			tx_port_id, tx_queue_id);

We must check that function pointers are not NULL as usual.
Return values must be checked.
Isn't is safe to setup map and than enable.
Otherwise we definitely need disable.
Also, what should happen on Tx port unplug? How to continue if
we still have Rx port up and running?

> +
> +	return 0;
> +}
> +
>   int
>   rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
>   {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
>   int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>   				       struct rte_eth_hairpin_cap *cap);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + *   It is assumed that the buffers have completed transmission belong to the
> + *   mempool used at the receive side, and have refcnt = 1.

I think it is possible to avoid such limitations, but
implementation will be less optimized - more checks.

> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.

I guess there is an assumption that Rx and Tx ports are
serviced by the same driver. If so and if it is an API
limitation, ethdev layer must check it.

> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + *
> + * @return
> + *   - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +			       uint16_t tx_port_id, uint16_t tx_queue_id);
> +
>   /**
>    * @warning
>    * @b EXPERIMENTAL: this structure may change without prior notice.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 20391ab29e..68d664498c 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -279,6 +279,7 @@ EXPERIMENTAL {
>   	rte_flow_async_action_handle_create;
>   	rte_flow_async_action_handle_destroy;
>   	rte_flow_async_action_handle_update;
> +	rte_eth_direct_rxrearm_map;
>   };
>   
>   INTERNAL {


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
  2022-04-20  9:59   ` Morten Brørup
  2022-04-20 10:41   ` Andrew Rybchenko
@ 2022-04-20 10:50   ` Jerin Jacob
  2022-05-02  3:09     ` 回复: " Feifei Wang
  2022-04-21 14:57   ` Stephen Hemminger
  3 siblings, 1 reply; 27+ messages in thread
From: Jerin Jacob @ 2022-04-20 10:50 UTC (permalink / raw)
  To: Feifei Wang
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	dpdk-dev, nd, Honnappa Nagarahalli, Ruifeng Wang

On Wed, Apr 20, 2022 at 1:47 PM Feifei Wang <feifei.wang2@arm.com> wrote:
>
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---

> + *
> + * @return
> + *   - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +                              uint16_t tx_port_id, uint16_t tx_queue_id);

Won't existing rte_eth_hairpin_* APIs work to achieve the same?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-20 10:10   ` Morten Brørup
@ 2022-04-21  2:35     ` Honnappa Nagarahalli
  2022-04-21  6:40       ` Morten Brørup
  0 siblings, 1 reply; 27+ messages in thread
From: Honnappa Nagarahalli @ 2022-04-21  2:35 UTC (permalink / raw)
  To: Morten Brørup, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd

<snip>

> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 20 April 2022 10.17
> >
> > Enable direct rearm mode. The mapping is decided in the data plane
> > based on the first packet received.
> 
> I usually don't care much about l3fwd, but putting configuration changes in the
> fast path is just wrong!
I would say it depends. In this case the cycles consumed by the API are very less and configuration data is very small and is already in the cache as PMD has accessed the same data structure. 

If the configuration needs more cycles than a typical (depending on the application) data plane packet processing needs or brings in enormous amount of data in to the cache, it should not be done on the data plane.

> 
> Also, l3fwd is often used for benchmarking, and this small piece of code in the
> fast path will affect benchmark results (although only very little).
We do not see any impact on the performance numbers. The reason for putting in the data plane was it covers wider use case in this L3fwd application. If the app were to be simple, the configuration could be done from the control plane. Unfortunately, the performance of L3fwd application matters.

> 
> Please move it out of the fast path.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-21  2:35     ` Honnappa Nagarahalli
@ 2022-04-21  6:40       ` Morten Brørup
  2022-05-10 22:01         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 27+ messages in thread
From: Morten Brørup @ 2022-04-21  6:40 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, 21 April 2022 04.35
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 20 April 2022 10.17
> > >
> > > Enable direct rearm mode. The mapping is decided in the data plane
> > > based on the first packet received.
> >
> > I usually don't care much about l3fwd, but putting configuration
> changes in the
> > fast path is just wrong!
> I would say it depends. In this case the cycles consumed by the API are
> very less and configuration data is very small and is already in the
> cache as PMD has accessed the same data structure.
> 
> If the configuration needs more cycles than a typical (depending on the
> application) data plane packet processing needs or brings in enormous
> amount of data in to the cache, it should not be done on the data
> plane.
> 

As a matter of principle, configuration changes should be done outside the fast path.

If we allow an exception for this feature, it will set a bad precedent about where to put configuration code.

> >
> > Also, l3fwd is often used for benchmarking, and this small piece of
> code in the
> > fast path will affect benchmark results (although only very little).
> We do not see any impact on the performance numbers. The reason for
> putting in the data plane was it covers wider use case in this L3fwd
> application. If the app were to be simple, the configuration could be
> done from the control plane. Unfortunately, the performance of L3fwd
> application matters.
> 

Let's proceed down that path for the sake of discussion... Then the fast path is missing runtime verification that all preconditions for using remapping are present at any time.

> >
> > Please move it out of the fast path.

BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable the feature.

And finally, this feature should be disabled by default, and only enabled by a command line parameter or similar. Otherwise, future l3fwd NIC performance reports will provide misleading performance results, if the feature is utilized. Application developers, when comparing NIC performance results, don't care about the performance for this unique use case; they care about the performance for the generic use case.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
                     ` (2 preceding siblings ...)
  2022-04-20 10:50   ` Jerin Jacob
@ 2022-04-21 14:57   ` Stephen Hemminger
  2022-04-29  6:35     ` 回复: " Feifei Wang
  3 siblings, 1 reply; 27+ messages in thread
From: Stephen Hemminger @ 2022-04-21 14:57 UTC (permalink / raw)
  To: Feifei Wang
  Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
	dev, nd, Honnappa Nagarahalli, Ruifeng Wang

On Wed, 20 Apr 2022 16:16:48 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:

> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
>  lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
>  lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
>  lib/ethdev/version.map     |  1 +
>  4 files changed, 61 insertions(+)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
>  typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
>  				    uint16_t rx_queue_id);
>  
> +/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> +						uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> +					uint16_t rx_queue_id,
> +					uint16_t tx_port_id,
> +					uint16_t tx_queue_id);
> +
>  /** @internal Release memory resources allocated by given Rx/Tx queue. */
>  typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
>  				    uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
>  	/** Disable Rx queue interrupt */
>  	eth_rx_disable_intr_t      rx_queue_intr_disable;
>  
> +	/** Enable Rx queue direct rearm mode */
> +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> +	/** Map Rx/Tx queue for direct rearm mode */
> +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> +
>  	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx queue */
>  	eth_queue_release_t        tx_queue_release; /**< Release Tx queue */
>  	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
>  	return eth_err(port_id, ret);
>  }
>  
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +		uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> +	struct rte_eth_dev *dev;
> +
> +	dev = &rte_eth_devices[rx_port_id];
> +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> +			tx_port_id, tx_queue_id);
> +
> +	return 0;
> +}
> +
>  int
>  rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
>  {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
>  int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
>  				       struct rte_eth_hairpin_cap *cap);
>  
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + *   It is assumed that the buffers have completed transmission belong to the
> + *   mempool used at the receive side, and have refcnt = 1.
> + *
> + * @param rx_port_id
> + *   Port identifying the receive side.
> + * @param rx_queue_id
> + *   The index of the receive queue identifying the receive side.
> + *   The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_port_id
> + *   Port identifying the transmit side.
> + * @param tx_queue_id
> + *   The index of the transmit queue identifying the transmit side.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + *
> + * @return
> + *   - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> +			       uint16_t tx_port_id, uint16_t tx_queue_id);

Just looking at this.

Why is this done via API call and not a flag as part of the receive config?
All the other offload and configuration happens via dev config.
Doing it this way doesn't follow the existing ethdev model.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20  9:59   ` Morten Brørup
@ 2022-04-29  2:42     ` Feifei Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-29  2:42 UTC (permalink / raw)
  To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, April 20, 2022 5:59 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella <mdr@ashroe.eu>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
> 
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 20 April 2022 10.17
> >
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> >  lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
> >  lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
> >  lib/ethdev/version.map     |  1 +
> >  4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev,  typedef int (*eth_rx_disable_intr_t)(struct
> > rte_eth_dev *dev,
> >  				    uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > device. */
> > +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> > +						uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > +					uint16_t rx_queue_id,
> > +					uint16_t tx_port_id,
> > +					uint16_t tx_queue_id);
> > +
> >  /** @internal Release memory resources allocated by given Rx/Tx queue.
> > */
> >  typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> >  				    uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> >  	/** Disable Rx queue interrupt */
> >  	eth_rx_disable_intr_t      rx_queue_intr_disable;
> >
> > +	/** Enable Rx queue direct rearm mode */
> > +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> 
> A disable function seems to be missing.
[Feifei] I will try to use offload bits to enable direct-rearm mode, thus this enable function will be
removed and disable function will be unnecessary.

> 
> > +	/** Map Rx/Tx queue for direct rearm mode */
> > +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> > +
> >  	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx
> > queue */
> >  	eth_queue_release_t        tx_queue_release; /**< Release Tx
> > queue */
> >  	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring
> > mbufs */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> > port_id, uint16_t tx_queue_id,
> >  	return eth_err(port_id, ret);
> >  }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	dev = &rte_eth_devices[rx_port_id];
> > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > +			tx_port_id, tx_queue_id);
> 
> Here you enable the mapping before you configure it. It could cause the
> driver to use an uninitialized map, if it processes packets between these two
> function calls.
[Feifei] I agree with this and will change the code.

> 
> Error handling is missing. Not all drivers support this feature, and the
> parameters should be validated.
[Feifei] You are right, I think after we use 'rxq->offload' bits, we can use some 'offload  bits API'
to check if driver can support this.
For the parameters, I will add some check.

> 
> Regarding driver support, the driver should also expose a capability flag to the
> application, similar to the RTE_ETH_DEV_CAPA_RXQ_SHARE or
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE flags. The documentation for this
> flag could include the description of all the restrictions to using it.
[Feifei] I  will do like this by 'rxq->offload' bits, and add description to the documentation.

> 
> > +
> > +	return 0;
> > +}
> > +
> >  int
> >  rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)  { diff
> > --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental  int
> > rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> >  				       struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be re-
> > armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + *   It is assumed that the buffers have completed transmission belong
> > to the
> > + *   mempool used at the receive side, and have refcnt = 1.
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> > supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> > supplied
> > + *   to rte_eth_dev_configure().
> > + *
> > + * @return
> > + *   - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> > rx_queue_id,
> > +			       uint16_t tx_port_id, uint16_t tx_queue_id);
> > +
> 
> I agree with the parameters to your proposed API here. Since the relevant
> use case only needs 1:1 mapping, exposing an API function to take some sort
> of array with N:M mappings would be premature, and probably not ever come
> into play anyway.
> 
> How do you remove, disable and/or change a mapping?
[Feifei] It is not recommended that users change the map in the process of sending and receiving packets,
which may bring some error risks. If user want to change mapping, he needs to stop the device and call
' rte_eth_direct_rxrearm_map ' API to rewrite the mapping.
Furthermore, for 'rxq->offload', user needs to set it before dev starts. If user want to change it, dev needs to be restarted.
> 
> >  /**
> >   * @warning
> >   * @b EXPERIMENTAL: this structure may change without prior notice.
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 20391ab29e..68d664498c 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -279,6 +279,7 @@ EXPERIMENTAL {
> >  	rte_flow_async_action_handle_create;
> >  	rte_flow_async_action_handle_destroy;
> >  	rte_flow_async_action_handle_update;
> > +	rte_eth_direct_rxrearm_map;
> >  };
> >
> >  INTERNAL {
> > --
> > 2.25.1
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20 10:41   ` Andrew Rybchenko
@ 2022-04-29  6:28     ` Feifei Wang
  2022-05-10 22:49       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 27+ messages in thread
From: Feifei Wang @ 2022-04-29  6:28 UTC (permalink / raw)
  To: Andrew Rybchenko, thomas, Ferruh Yigit, Ray Kinsella
  Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----邮件原件-----
> 发件人: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 发送时间: Wednesday, April 20, 2022 6:41 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@intel.com>; Ray Kinsella <mdr@ashroe.eu>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
> 
> On 4/20/22 11:16, Feifei Wang wrote:
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >   lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> >   lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
> >   lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
> >   lib/ethdev/version.map     |  1 +
> >   4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> rte_eth_dev *dev,
> >   typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> >   				    uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> rte_eth_dev *dev,
> > +						uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > +					uint16_t rx_queue_id,
> > +					uint16_t tx_port_id,
> > +					uint16_t tx_queue_id);
> > +
> >   /** @internal Release memory resources allocated by given Rx/Tx queue.
> */
> >   typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> >   				    uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> >   	/** Disable Rx queue interrupt */
> >   	eth_rx_disable_intr_t      rx_queue_intr_disable;
> >
> > +	/** Enable Rx queue direct rearm mode */
> > +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > +	/** Map Rx/Tx queue for direct rearm mode */
> > +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> > +
> >   	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx
> queue */
> >   	eth_queue_release_t        tx_queue_release; /**< Release Tx queue
> */
> >   	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs
> */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> port_id, uint16_t tx_queue_id,
> >   	return eth_err(port_id, ret);
> >   }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	dev = &rte_eth_devices[rx_port_id];
> 
> I think it is rather control path. So:
> We need standard checks that rx_port_id is valid.
> tx_port_id must be checked as well.
> rx_queue_id and tx_queue_id must be checked to be in the rate.
[Feifei] You are right, I will add check for these.

> 
> > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > +			tx_port_id, tx_queue_id);
> 
> We must check that function pointers are not NULL as usual.
> Return values must be checked.
[Feifei] I agree with this, The check for pointer and return value will be added

> Isn't is safe to setup map and than enable.
> Otherwise we definitely need disable.
[Feifei] I will change code that map first and then set 'rxq->offload' to enable direct-rearm mode.

> Also, what should happen on Tx port unplug? How to continue if we still have
> Rx port up and running?
[Feifei] For direct rearm mode, if Tx port unplug, it means there is no buffer from Tx.
And then, Rx will put buffer from mempool as usual for rearm.

> 
> > +
> > +	return 0;
> > +}
> > +
> >   int
> >   rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
> >   {
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental
> >   int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> >   				       struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be
> > +re-armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + *   It is assumed that the buffers have completed transmission belong to
> the
> > + *   mempool used at the receive side, and have refcnt = 1.
> 
> I think it is possible to avoid such limitations, but implementation will be less
> optimized - more checks.
[Feifei] For the first limitation: Rx and Tx buffers should be from the same mempool.
If we want to check this, we will add a check for each packet in the data-plane, this will
significantly reduce performance. So, it is better to note this for users rather than adding
check code.
For the second limitation: refcnt = 1. We have now add code to support 'refcnt = 1' in direct-rearm
mode, so this note can be removed in the next version.

> 
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> 
> I guess there is an assumption that Rx and Tx ports are serviced by the same
> driver. If so and if it is an API limitation, ethdev layer must check it.
[Feifei] I agree with this. For the check that Rx and Tx port should be the same driver, 
I will add check for this.

> 
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + *
> > + * @return
> > + *   - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > +			       uint16_t tx_port_id, uint16_t tx_queue_id);
> > +
> >   /**
> >    * @warning
> >    * @b EXPERIMENTAL: this structure may change without prior notice.
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 20391ab29e..68d664498c 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -279,6 +279,7 @@ EXPERIMENTAL {
> >   	rte_flow_async_action_handle_create;
> >   	rte_flow_async_action_handle_destroy;
> >   	rte_flow_async_action_handle_update;
> > +	rte_eth_direct_rxrearm_map;
> >   };
> >
> >   INTERNAL {


^ permalink raw reply	[flat|nested] 27+ messages in thread

* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-21 14:57   ` Stephen Hemminger
@ 2022-04-29  6:35     ` Feifei Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Feifei Wang @ 2022-04-29  6:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dev, nd,
	Honnappa Nagarahalli, Ruifeng Wang, nd



> -----邮件原件-----
> 发件人: Stephen Hemminger <stephen@networkplumber.org>
> 发送时间: Thursday, April 21, 2022 10:58 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; dev@dpdk.org; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
> 
> On Wed, 20 Apr 2022 16:16:48 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
> 
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> >  lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
> >  lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
> >  lib/ethdev/version.map     |  1 +
> >  4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev,  typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev
> *dev,
> >  				    uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> rte_eth_dev *dev,
> > +						uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > +					uint16_t rx_queue_id,
> > +					uint16_t tx_port_id,
> > +					uint16_t tx_queue_id);
> > +
> >  /** @internal Release memory resources allocated by given Rx/Tx
> > queue. */  typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> >  				    uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> >  	/** Disable Rx queue interrupt */
> >  	eth_rx_disable_intr_t      rx_queue_intr_disable;
> >
> > +	/** Enable Rx queue direct rearm mode */
> > +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > +	/** Map Rx/Tx queue for direct rearm mode */
> > +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> > +
> >  	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx
> queue */
> >  	eth_queue_release_t        tx_queue_release; /**< Release Tx queue
> */
> >  	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs
> */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> port_id, uint16_t tx_queue_id,
> >  	return eth_err(port_id, ret);
> >  }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > +	struct rte_eth_dev *dev;
> > +
> > +	dev = &rte_eth_devices[rx_port_id];
> > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > +			tx_port_id, tx_queue_id);
> > +
> > +	return 0;
> > +}
> > +
> >  int
> >  rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)  { diff
> > --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental  int
> > rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> >  				       struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be
> > +re-armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + *   It is assumed that the buffers have completed transmission belong to
> the
> > + *   mempool used at the receive side, and have refcnt = 1.
> > + *
> > + * @param rx_port_id
> > + *   Port identifying the receive side.
> > + * @param rx_queue_id
> > + *   The index of the receive queue identifying the receive side.
> > + *   The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + *   Port identifying the transmit side.
> > + * @param tx_queue_id
> > + *   The index of the transmit queue identifying the transmit side.
> > + *   The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + *   to rte_eth_dev_configure().
> > + *
> > + * @return
> > + *   - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > +			       uint16_t tx_port_id, uint16_t tx_queue_id);
> 
> Just looking at this.
> 
> Why is this done via API call and not a flag as part of the receive config?
> All the other offload and configuration happens via dev config.
> Doing it this way doesn't follow the existing ethdev model.
[Feifei] Agree with this. I will remove direct-rearm enable function and
use "rxq->offload" bit to enable it.
For rte_eth_direct_rxrearm_map, I think it is necessary for users to call it to map Rx/Tx queue.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-20 10:50   ` Jerin Jacob
@ 2022-05-02  3:09     ` Feifei Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Feifei Wang @ 2022-05-02  3:09 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dpdk-dev,
	nd, Honnappa Nagarahalli, Ruifeng Wang, nd



> -----邮件原件-----
> 发件人: Jerin Jacob <jerinjacobk@gmail.com>
> 发送时间: Wednesday, April 20, 2022 6:50 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
> 
> On Wed, Apr 20, 2022 at 1:47 PM Feifei Wang <feifei.wang2@arm.com>
> wrote:
> >
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> 
> > + *
> > + * @return
> > + *   - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > +                              uint16_t tx_port_id, uint16_t
> > +tx_queue_id);
> 
> Won't existing rte_eth_hairpin_* APIs work to achieve the same?
[Feifei] Thanks for the comment. Look at the hairpin feature which is enabled in MLX5 driver.

I think the most important difference is that hairpin just re-directs the packet from the Rx queue
to Tx queue in the same port, and Rx/Tx queue just  can record the peer queue id.
For direct rearm, it can map Rx queue to the Tx queue which are from different ports. And this needs
Rx queue records paired port id and queue id. 

Furthermore, hairpin needs to set up new hairpin queue and then it can bind Rx queue to Tx queue.
and direct-rearm just can use normal queue to map. This is due to direct rearm needs used buffers and
it doesn't care about packet.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-21  6:40       ` Morten Brørup
@ 2022-05-10 22:01         ` Honnappa Nagarahalli
  2022-05-11  7:17           ` Morten Brørup
  0 siblings, 1 reply; 27+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-10 22:01 UTC (permalink / raw)
  To: Morten Brørup, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd

(apologies for the late response, this one slipped my mind)

Appreciate if others could weigh their opinions.

<snip>
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, 21 April 2022 04.35
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Wednesday, 20 April 2022 10.17
> > > >
> > > > Enable direct rearm mode. The mapping is decided in the data plane
> > > > based on the first packet received.
> > >
> > > I usually don't care much about l3fwd, but putting configuration
> > changes in the
> > > fast path is just wrong!
> > I would say it depends. In this case the cycles consumed by the API
> > are very less and configuration data is very small and is already in
> > the cache as PMD has accessed the same data structure.
> >
> > If the configuration needs more cycles than a typical (depending on
> > the
> > application) data plane packet processing needs or brings in enormous
> > amount of data in to the cache, it should not be done on the data
> > plane.
> >
> 
> As a matter of principle, configuration changes should be done outside the fast
> path.
> 
> If we allow an exception for this feature, it will set a bad precedent about
> where to put configuration code.
I think there are other examples though not exactly the same. For ex: the seqlock, we cannot have a scheduled out writer while holding the lock. But, it was mentioned that this can be over come easily by running the writer on an isolated core (which to me breaks some principles).

> 
> > >
> > > Also, l3fwd is often used for benchmarking, and this small piece of
> > code in the
> > > fast path will affect benchmark results (although only very little).
> > We do not see any impact on the performance numbers. The reason for
> > putting in the data plane was it covers wider use case in this L3fwd
> > application. If the app were to be simple, the configuration could be
> > done from the control plane. Unfortunately, the performance of L3fwd
> > application matters.
> >
> 
> Let's proceed down that path for the sake of discussion... Then the fast path is
> missing runtime verification that all preconditions for using remapping are
> present at any time.
Agree, few checks (ensuring that TX and RX buffers are from the same pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
We will add these, it is possible to add these checks outside the packet processing loop.

> 
> > >
> > > Please move it out of the fast path.
> 
> BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable
> the feature.
> 
> And finally, this feature should be disabled by default, and only enabled by a
> command line parameter or similar. Otherwise, future l3fwd NIC performance
> reports will provide misleading performance results, if the feature is utilized.
> Application developers, when comparing NIC performance results, don't care
> about the performance for this unique use case; they care about the
> performance for the generic use case.
> 
I think this feature is similar to fast free feature (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other thread. It should be handled similar to how fast free feature is handled.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
  2022-04-29  6:28     ` 回复: " Feifei Wang
@ 2022-05-10 22:49       ` Honnappa Nagarahalli
  0 siblings, 0 replies; 27+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-10 22:49 UTC (permalink / raw)
  To: Feifei Wang, Andrew Rybchenko, thomas, Ferruh Yigit, Ray Kinsella
  Cc: dev, nd, Ruifeng Wang, nd

<snip>

> >
> > On 4/20/22 11:16, Feifei Wang wrote:
> > > Add API for enabling direct rearm mode and for mapping RX and TX
> > > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > >   lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> > >   lib/ethdev/rte_ethdev.c    | 14 ++++++++++++++
> > >   lib/ethdev/rte_ethdev.h    | 31 +++++++++++++++++++++++++++++++
> > >   lib/ethdev/version.map     |  1 +
> > >   4 files changed, 61 insertions(+)
> > >
> > > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > > index 69d9dc21d8..22022f6da9 100644
> > > --- a/lib/ethdev/ethdev_driver.h
> > > +++ b/lib/ethdev/ethdev_driver.h
> > > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev,
> > >   typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> > >   				    uint16_t rx_queue_id);
> > >
> > > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> > rte_eth_dev *dev,
> > > +						uint16_t queue_id);
> > > +
> > > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > > +					uint16_t rx_queue_id,
> > > +					uint16_t tx_port_id,
> > > +					uint16_t tx_queue_id);
> > > +
> > >   /** @internal Release memory resources allocated by given Rx/Tx queue.
> > */
> > >   typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> > >   				    uint16_t queue_id);
> > > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> > >   	/** Disable Rx queue interrupt */
> > >   	eth_rx_disable_intr_t      rx_queue_intr_disable;
> > >
> > > +	/** Enable Rx queue direct rearm mode */
> > > +	eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > > +	/** Map Rx/Tx queue for direct rearm mode */
> > > +	eth_rx_direct_rearm_map_t  rx_queue_direct_rearm_map;
> > > +
> > >   	eth_tx_queue_setup_t       tx_queue_setup;/**< Set up device Tx
> > queue */
> > >   	eth_queue_release_t        tx_queue_release; /**< Release Tx queue
> > */
> > >   	eth_tx_done_cleanup_t      tx_done_cleanup;/**< Free Tx ring mbufs
> > */
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 29a3d80466..8e6f0284f4 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> > port_id, uint16_t tx_queue_id,
> > >   	return eth_err(port_id, ret);
> > >   }
> > >
> > > +int
> > > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > +		uint16_t tx_port_id, uint16_t tx_queue_id) {
> > > +	struct rte_eth_dev *dev;
> > > +
> > > +	dev = &rte_eth_devices[rx_port_id];
> >
> > I think it is rather control path. So:
> > We need standard checks that rx_port_id is valid.
> > tx_port_id must be checked as well.
> > rx_queue_id and tx_queue_id must be checked to be in the rate.
> [Feifei] You are right, I will add check for these.
> 
> >
> > > +	(*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> > rx_queue_id);
> > > +	(*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > > +			tx_port_id, tx_queue_id);
> >
> > We must check that function pointers are not NULL as usual.
> > Return values must be checked.
> [Feifei] I agree with this, The check for pointer and return value will be added
> 
> > Isn't is safe to setup map and than enable.
> > Otherwise we definitely need disable.
> [Feifei] I will change code that map first and then set 'rxq->offload' to enable
> direct-rearm mode.
> 
> > Also, what should happen on Tx port unplug? How to continue if we
> > still have Rx port up and running?
> [Feifei] For direct rearm mode, if Tx port unplug, it means there is no buffer
> from Tx.
> And then, Rx will put buffer from mempool as usual for rearm.
Andrew, when you say 'TX port unplug', do you mean the 'rte_eth_dev_tx_queue_stop' is called? Is calling 'rte_eth_dev_tx_queue_stop' allowed when the device is running?

> 
> >
> > > +
> > > +	return 0;
> > > +}
> > > +
<snip>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-05-10 22:01         ` Honnappa Nagarahalli
@ 2022-05-11  7:17           ` Morten Brørup
  0 siblings, 0 replies; 27+ messages in thread
From: Morten Brørup @ 2022-05-11  7:17 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 11 May 2022 00.02
> 
> (apologies for the late response, this one slipped my mind)
> 
> Appreciate if others could weigh their opinions.
> 
> <snip>
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Thursday, 21 April 2022 04.35
> > > >
> > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > > Sent: Wednesday, 20 April 2022 10.17
> > > > >
> > > > > Enable direct rearm mode. The mapping is decided in the data
> plane
> > > > > based on the first packet received.
> > > >
> > > > I usually don't care much about l3fwd, but putting configuration
> > > changes in the
> > > > fast path is just wrong!
> > > I would say it depends. In this case the cycles consumed by the API
> > > are very less and configuration data is very small and is already
> in
> > > the cache as PMD has accessed the same data structure.
> > >
> > > If the configuration needs more cycles than a typical (depending on
> > > the
> > > application) data plane packet processing needs or brings in
> enormous
> > > amount of data in to the cache, it should not be done on the data
> > > plane.
> > >
> >
> > As a matter of principle, configuration changes should be done
> outside the fast
> > path.
> >
> > If we allow an exception for this feature, it will set a bad
> precedent about
> > where to put configuration code.
> I think there are other examples though not exactly the same. For ex:
> the seqlock, we cannot have a scheduled out writer while holding the
> lock. But, it was mentioned that this can be over come easily by
> running the writer on an isolated core (which to me breaks some
> principles).

Referring to a bad example (which breaks some principles) does not change my opinion. ;-)

> 
> >
> > > >
> > > > Also, l3fwd is often used for benchmarking, and this small piece
> of
> > > code in the
> > > > fast path will affect benchmark results (although only very
> little).
> > > We do not see any impact on the performance numbers. The reason for
> > > putting in the data plane was it covers wider use case in this
> L3fwd
> > > application. If the app were to be simple, the configuration could
> be
> > > done from the control plane. Unfortunately, the performance of
> L3fwd
> > > application matters.
> > >
> >
> > Let's proceed down that path for the sake of discussion... Then the
> fast path is
> > missing runtime verification that all preconditions for using
> remapping are
> > present at any time.
> Agree, few checks (ensuring that TX and RX buffers are from the same
> pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
> We will add these, it is possible to add these checks outside the
> packet processing loop.
> 
> >
> > > >
> > > > Please move it out of the fast path.
> >
> > BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to
> enable
> > the feature.
> >
> > And finally, this feature should be disabled by default, and only
> enabled by a
> > command line parameter or similar. Otherwise, future l3fwd NIC
> performance
> > reports will provide misleading performance results, if the feature
> is utilized.
> > Application developers, when comparing NIC performance results, don't
> care
> > about the performance for this unique use case; they care about the
> > performance for the generic use case.
> >
> I think this feature is similar to fast free feature
> (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other
> thread. It should be handled similar to how fast free feature is
> handled.

I agree with this comparison.

Quickly skimming l3fwd/main.c reveals that RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is used without checking preconditions, and thus might be buggy. E.g. what happens when the NICs support RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE, and l3fwd is run with the "per-port-pool" command line option? Obviously, the "direct rearm" patch should not be punished because of bugs in similar features ("fast free"). But it is not a valid reason to allow similar bugs. You mentioned above that precondition checking will be added, so pardon me for ranting a bit here. ;-)

Furthermore, if using l3fwd for NIC performance reports, I find the results misleading if application specific optimizations are used without mentioning it in the report. This applies to both "fast free" and "direct rearm" optimizations - they only work in specific application scenarios, and thus the l3fwd NIC performance test should be run without these optimization, or at least mention that the report only covers these specific applications. Which is why I prefer that such optimizations must be explicitly enabled through a command line parameter, and not used in testing for official NIC performance reports. Taking one step back, the real problem here is that an *example* application is used for NIC performance testing, and this is the main reason for my performance related objections. I should probably object to using l3fwd for NIC performance testing instead.

I don't feel strongly about l3fwd, so I will not object to the l3fwd patch. Just providing some feedback. :-)



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/5] net/i40e: enable direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
@ 2022-05-11 22:28   ` Konstantin Ananyev
  0 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:28 UTC (permalink / raw)
  To: Feifei Wang, Beilei Xing, Bruce Richardson, Konstantin Ananyev,
	Ruifeng Wang
  Cc: dev, nd, Honnappa Nagarahalli


> For i40e driver, enable direct re-arm mode. This patch supports the case
> of mapping Rx/Tx queues from the same single lcore.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   drivers/net/i40e/i40e_rxtx.h            |   4 +
>   drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
>   drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
>   drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
>   6 files changed, 839 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
> index 5e6eecc501..1fdf4305f4 100644
> --- a/drivers/net/i40e/i40e_rxtx.h
> +++ b/drivers/net/i40e/i40e_rxtx.h
> @@ -102,6 +102,8 @@ struct i40e_rx_queue {
>   
>   	uint16_t rxrearm_nb;	/**< number of remaining to be re-armed */
>   	uint16_t rxrearm_start;	/**< the idx we start the re-arming from */
> +	uint16_t direct_rxrearm_port; /** device TX port ID for direct re-arm mode */
> +	uint16_t direct_rxrearm_queue; /** TX queue index for direct re-arm mode */
>   	uint64_t mbuf_initializer; /**< value to init mbufs */
>   
>   	uint16_t port_id; /**< device port ID */
> @@ -121,6 +123,8 @@ struct i40e_rx_queue {
>   	uint16_t rx_using_sse; /**<flag indicate the usage of vPMD for rx */
>   	uint8_t dcb_tc;         /**< Traffic class of rx queue */
>   	uint64_t offloads; /**< Rx offload flags of RTE_ETH_RX_OFFLOAD_* */
> +	/**<  0 if direct re-arm mode disabled, 1 when enabled */
> +	bool direct_rxrearm_enable;
>   	const struct rte_memzone *mz;
>   };
>   
> diff --git a/drivers/net/i40e/i40e_rxtx_common_avx.h b/drivers/net/i40e/i40e_rxtx_common_avx.h
> index cfc1e63173..a742723e07 100644
> --- a/drivers/net/i40e/i40e_rxtx_common_avx.h
> +++ b/drivers/net/i40e/i40e_rxtx_common_avx.h
> @@ -209,6 +209,275 @@ i40e_rxq_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
>   	/* Update the tail pointer on the NIC */
>   	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
>   }
> +
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
> +{
> +	struct rte_eth_dev *dev;
> +	struct i40e_tx_queue *txq;rivers/net/i40e/i40e_rxtx_common_avx.h
> +	volatile union i40e_rx_desc *rxdp;
> +	struct i40e_tx_entry *txep;
> +	struct i40e_rx_entry *rxep;
> +	struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
> +	uint16_t tx_port_id, tx_queue_id;
> +	uint16_t rx_id;
> +	uint16_t i, n;
> +	uint16_t nb_rearm = 0;
> +
> +	rxdp = rxq->rx_ring + rxq->rxrearm_start;
> +	rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	tx_port_id = rxq->direct_rxrearm_port;
> +	tx_queue_id = rxq->direct_rxrearm_queue;
> +	dev = &rte_eth_devices[tx_port_id];
> +	txq = dev->data->tx_queues[tx_queue_id];
> +
> +	/* check Rx queue is able to take in the whole
> +	 * batch of free mbufs from Tx queue
> +	 */
> +	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> +		/* check DD bits on threshold descriptor */
> +		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> +				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> +				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> +			goto mempool_bulk;
> +		}
> +
> +		if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
> +			goto mempool_bulk;

I think all these checks (is this mode can be enabled) should be done at 
config phase, not at data-path.

> +
> +		n = txq->tx_rs_thresh;
> +
> +		/* first buffer to free from S/W ring is at index
> +		 * tx_next_dd - (tx_rs_thresh-1)
> +		 */
> +		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];


It really looks bad that RX function acesses and modifies TXQ data 
directly. Would be much better to hide TXD checking/manipulation into a 
separate TXQ function (txq_mbuf() or so) that RX path can invoke.

> +
> +		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +			/* directly put mbufs from Tx to Rx,
> +			 * and initialize the mbufs in vector
> +			 */
> +			for (i = 0; i < n; i++)
> +				rxep[i].mbuf = txep[i].mbuf;
> +		} else {
> +			for (i = 0; i < n; i++) {
> +				m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +				/* ensure each Tx freed buffer is valid */
> +				if (m[i] != NULL)
> +					nb_rearm++;
> +			}
> +
> +			if (nb_rearm != n) {
> +				txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +				txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +				if (txq->tx_next_dd >= txq->nb_tx_desc)
> +					txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);


So if nb_rearm != 0 what would happen with mbufs collected in m[]?
Are you just dropping/forgetting them?


> +
> +				goto mempool_bulk;

> +			} else {
> +				for (i = 0; i < n; i++)
> +					rxep[i].mbuf = m[i];
> +			}
> +		}
> +
> +		/* update counters for Tx */
> +		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +		if (txq->tx_next_dd >= txq->nb_tx_desc)
> +			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +	} else {


I suppose the chunk of code below is just a copy&paste of
exising i40e_rxq_direct_rearm_common()?
If so, no point to duplicate it, better to just invoke it here
(I presume a bit of re-factoring) would be need for that.

Pretty much same thoughts for other rearm functions below.

> +mempool_bulk:
> +		/* if TX did not free bufs into Rx sw-ring,
> +		 * get new bufs from mempool
> +		 */
> +		n = RTE_I40E_RXQ_REARM_THRESH;
> +
> +		/* Pull 'n' more MBUFs into the software ring */
> +		if (rte_mempool_get_bulk(rxq->mp,
> +					(void *)rxep,
> +					RTE_I40E_RXQ_REARM_THRESH) < 0) {
> +			if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
> +				rxq->nb_rx_desc) {
> +				__m128i dma_addr0;
> +				dma_addr0 = _mm_setzero_si128();
> +				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> +					rxep[i].mbuf = &rxq->fake_mbuf;
> +					_mm_store_si128((__m128i *)&rxdp[i].read,
> +							dma_addr0);
> +				}
> +			}
> +			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> +				RTE_I40E_RXQ_REARM_THRESH;
> +			return;
> +		}
> +	}
> +
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> +	struct rte_mbuf *mb0, *mb1;
> +	__m128i dma_addr0, dma_addr1;
> +	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
> +			RTE_PKTMBUF_HEADROOM);
> +	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
> +	for (i = 0; i < n; i += 2, rxep += 2) {
> +		__m128i vaddr0, vaddr1;
> +
> +		mb0 = rxep[0].mbuf;
> +		mb1 = rxep[1].mbuf;
> +
> +		/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +		RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +				offsetof(struct rte_mbuf, buf_addr) + 8);
> +		vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +		vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +
> +		/* convert pa to dma_addr hdr/data */
> +		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
> +
> +		/* add headroom to pa values */
> +		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
> +
> +		/* flush desc with pa dma_addr */
> +		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> +		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
> +	}
> +#else
> +#ifdef __AVX512VL__
> +	if (avx512) {
> +		struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
> +		struct rte_mbuf *mb4, *mb5, *mb6, *mb7;
> +		__m512i dma_addr0_3, dma_addr4_7;
> +		__m512i hdr_room = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
> +		/* Initialize the mbufs in vector, process 8 mbufs in one loop */
> +		for (i = 0; i < n; i += 8, rxep += 8, rxdp += 8) {
> +			__m128i vaddr0, vaddr1, vaddr2, vaddr3;
> +			__m128i vaddr4, vaddr5, vaddr6, vaddr7;
> +			__m256i vaddr0_1, vaddr2_3;
> +			__m256i vaddr4_5, vaddr6_7;
> +			__m512i vaddr0_3, vaddr4_7;
> +
> +			mb0 = rxep[0].mbuf;
> +			mb1 = rxep[1].mbuf;
> +			mb2 = rxep[2].mbuf;
> +			mb3 = rxep[3].mbuf;
> +			mb4 = rxep[4].mbuf;
> +			mb5 = rxep[5].mbuf;
> +			mb6 = rxep[6].mbuf;
> +			mb7 = rxep[7].mbuf;
> +
> +			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +					offsetof(struct rte_mbuf, buf_addr) + 8);
> +			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +			vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
> +			vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
> +			vaddr4 = _mm_loadu_si128((__m128i *)&mb4->buf_addr);
> +			vaddr5 = _mm_loadu_si128((__m128i *)&mb5->buf_addr);
> +			vaddr6 = _mm_loadu_si128((__m128i *)&mb6->buf_addr);
> +			vaddr7 = _mm_loadu_si128((__m128i *)&mb7->buf_addr);
> +
> +			/**
> +			 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
> +			 * into the high lanes. Similarly for 2 & 3, and so on.
> +			 */
> +			vaddr0_1 =
> +				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
> +							vaddr1, 1);
> +			vaddr2_3 =
> +				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
> +							vaddr3, 1);
> +			vaddr4_5 =
> +				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr4),
> +							vaddr5, 1);
> +			vaddr6_7 =
> +				_mm256_inserti128_si256(_mm256_castsi128_si256(vaddr6),
> +							vaddr7, 1);
> +			vaddr0_3 =
> +				_mm512_inserti64x4(_mm512_castsi256_si512(vaddr0_1),
> +						   vaddr2_3, 1);
> +			vaddr4_7 =
> +				_mm512_inserti64x4(_mm512_castsi256_si512(vaddr4_5),
> +						   vaddr6_7, 1);
> +
> +			/* convert pa to dma_addr hdr/data */
> +			dma_addr0_3 = _mm512_unpackhi_epi64(vaddr0_3, vaddr0_3);
> +			dma_addr4_7 = _mm512_unpackhi_epi64(vaddr4_7, vaddr4_7);
> +
> +			/* add headroom to pa values */
> +			dma_addr0_3 = _mm512_add_epi64(dma_addr0_3, hdr_room);
> +			dma_addr4_7 = _mm512_add_epi64(dma_addr4_7, hdr_room);
> +
> +			/* flush desc with pa dma_addr */
> +			_mm512_store_si512((__m512i *)&rxdp->read, dma_addr0_3);
> +			_mm512_store_si512((__m512i *)&(rxdp + 4)->read, dma_addr4_7);
> +		}
> +	} else {
> +#endif /* __AVX512VL__*/
> +		struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
> +		__m256i dma_addr0_1, dma_addr2_3;
> +		__m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
> +		/* Initialize the mbufs in vector, process 4 mbufs in one loop */
> +		for (i = 0; i < n; i += 4, rxep += 4, rxdp += 4) {
> +			__m128i vaddr0, vaddr1, vaddr2, vaddr3;
> +			__m256i vaddr0_1, vaddr2_3;
> +
> +			mb0 = rxep[0].mbuf;
> +			mb1 = rxep[1].mbuf;
> +			mb2 = rxep[2].mbuf;
> +			mb3 = rxep[3].mbuf;
> +
> +			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +					offsetof(struct rte_mbuf, buf_addr) + 8);
> +			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +			vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
> +			vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
> +
> +			/**
> +			 * merge 0 & 1, by casting 0 to 256-bit and inserting 1
> +			 * into the high lanes. Similarly for 2 & 3
> +			 */
> +			vaddr0_1 = _mm256_inserti128_si256
> +				(_mm256_castsi128_si256(vaddr0), vaddr1, 1);
> +			vaddr2_3 = _mm256_inserti128_si256
> +				(_mm256_castsi128_si256(vaddr2), vaddr3, 1);
> +
> +			/* convert pa to dma_addr hdr/data */
> +			dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
> +			dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
> +
> +			/* add headroom to pa values */
> +			dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
> +			dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
> +
> +			/* flush desc with pa dma_addr */
> +			_mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
> +			_mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
> +		}
> +	}
> +
> +#endif
> +
> +	/* Update the descriptor initializer index */
> +	rxq->rxrearm_start += n;
> +	rx_id = rxq->rxrearm_start - 1;
> +
> +	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> +		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> +		if (!rxq->rxrearm_start)
> +			rx_id = rxq->nb_rx_desc - 1;
> +		else
> +			rx_id = rxq->rxrearm_start - 1;
> +	}
> +
> +	rxq->rxrearm_nb -= n;
> +
> +	/* Update the tail pointer on the NIC */
> +	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> +}
>   #endif /* __AVX2__*/
>   
>   #endif /*_I40E_RXTX_COMMON_AVX_H_*/
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> index c73b2a321b..fcb7ba0273 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> @@ -25,6 +25,12 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
>   	return i40e_rxq_rearm_common(rxq, false);
>   }
>   
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> +	return i40e_rxq_direct_rearm_common(rxq, false);
> +}
> +
>   #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
>   /* Handles 32B descriptor FDIR ID processing:
>    * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
> @@ -128,8 +134,12 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>   	/* See if we need to rearm the RX queue - gives the prefetch a bit
>   	 * of time to act
>   	 */
> -	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> -		i40e_rxq_rearm(rxq);
> +	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> +		if (rxq->direct_rxrearm_enable)
> +			i40e_rxq_direct_rearm(rxq);
> +		else
> +			i40e_rxq_rearm(rxq);
> +	}
>   
>   	/* Before we start moving massive data around, check to see if
>   	 * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx512.c b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> index 2e8a3f0df6..d967095edc 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> @@ -21,6 +21,12 @@
>   
>   #define RTE_I40E_DESCS_PER_LOOP_AVX 8
>   
> +enum i40e_direct_rearm_type_value {
> +	I40E_DIRECT_REARM_TYPE_NORMAL		= 0x0,
> +	I40E_DIRECT_REARM_TYPE_FAST_FREE	= 0x1,
> +	I40E_DIRECT_REARM_TYPE_PRE_FREE		= 0x2,
> +};
> +
>   static __rte_always_inline void
>   i40e_rxq_rearm(struct i40e_rx_queue *rxq)
>   {
> @@ -150,6 +156,241 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
>   	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
>   }
>   
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> +	struct rte_eth_dev *dev;
> +	struct i40e_tx_queue *txq;
> +	volatile union i40e_rx_desc *rxdp;
> +	struct i40e_vec_tx_entry *txep;
> +	struct i40e_rx_entry *rxep;
> +	struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
> +	uint16_t tx_port_id, tx_queue_id;
> +	uint16_t rx_id;
> +	uint16_t i, n;
> +	uint16_t j = 0;
> +	uint16_t nb_rearm = 0;
> +	enum i40e_direct_rearm_type_value type;
> +	struct rte_mempool_cache *cache = NULL;
> +
> +	rxdp = rxq->rx_ring + rxq->rxrearm_start;
> +	rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	tx_port_id = rxq->direct_rxrearm_port;
> +	tx_queue_id = rxq->direct_rxrearm_queue;
> +	dev = &rte_eth_devices[tx_port_id];
> +	txq = dev->data->tx_queues[tx_queue_id];
> +
> +	/* check Rx queue is able to take in the whole
> +	 * batch of free mbufs from Tx queue
> +	 */
> +	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> +		/* check DD bits on threshold descriptor */
> +		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> +			rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> +			rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> +			goto mempool_bulk;
> +		}
> +
> +		if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
> +			goto mempool_bulk;
> +
> +		n = txq->tx_rs_thresh;
> +
> +		/* first buffer to free from S/W ring is at index
> +		 * tx_next_dd - (tx_rs_thresh-1)
> +		 */
> +		txep = (void *)txq->sw_ring;
> +		txep += txq->tx_next_dd - (n - 1);
> +
> +		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +			/* directly put mbufs from Tx to Rx */
> +			uint32_t copied = 0;
> +			/* n is multiple of 32 */
> +			while (copied < n) {
> +				const __m512i a = _mm512_load_si512(&txep[copied]);
> +				const __m512i b = _mm512_load_si512(&txep[copied + 8]);
> +				const __m512i c = _mm512_load_si512(&txep[copied + 16]);
> +				const __m512i d = _mm512_load_si512(&txep[copied + 24]);
> +
> +				_mm512_storeu_si512(&rxep[copied], a);
> +				_mm512_storeu_si512(&rxep[copied + 8], b);
> +				_mm512_storeu_si512(&rxep[copied + 16], c);
> +				_mm512_storeu_si512(&rxep[copied + 24], d);
> +				copied += 32;
> +			}
> +			type = I40E_DIRECT_REARM_TYPE_FAST_FREE;
> +		} else {
> +			for (i = 0; i < n; i++) {
> +				m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +				/* ensure each Tx freed buffer is valid */
> +				if (m[i] != NULL)
> +					nb_rearm++;
> +			}
> +
> +			if (nb_rearm != n) {
> +				txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +				txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +				if (txq->tx_next_dd >= txq->nb_tx_desc)
> +					txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> +				goto mempool_bulk;
> +			} else {
> +				type = I40E_DIRECT_REARM_TYPE_PRE_FREE;
> +			}
> +		}
> +
> +	/* update counters for Tx */
> +	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +	txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +	if (txq->tx_next_dd >= txq->nb_tx_desc)
> +		txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +	} else {
> +mempool_bulk:
> +		cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id());
> +
> +		if (unlikely(!cache))
> +			return i40e_rxq_rearm_common(rxq, true);
> +
> +		n = RTE_I40E_RXQ_REARM_THRESH;
> +
> +		/* We need to pull 'n' more MBUFs into the software ring from mempool
> +		 * We inline the mempool function here, so we can vectorize the copy
> +		 * from the cache into the shadow ring.
> +		 */
> +
> +		if (cache->len < RTE_I40E_RXQ_REARM_THRESH) {
> +			/* No. Backfill the cache first, and then fill from it */
> +			uint32_t req = RTE_I40E_RXQ_REARM_THRESH + (cache->size -
> +					cache->len);
> +
> +			/* How many do we require
> +			 * i.e. number to fill the cache + the request
> +			 */
> +			int ret = rte_mempool_ops_dequeue_bulk(rxq->mp,
> +					&cache->objs[cache->len], req);
> +			if (ret == 0) {
> +				cache->len += req;
> +			} else {
> +				if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
> +						rxq->nb_rx_desc) {
> +					__m128i dma_addr0;
> +
> +					dma_addr0 = _mm_setzero_si128();
> +					for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> +						rxep[i].mbuf = &rxq->fake_mbuf;
> +						_mm_store_si128
> +							((__m128i *)&rxdp[i].read,
> +								dma_addr0);
> +					}
> +				}
> +				rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> +						RTE_I40E_RXQ_REARM_THRESH;
> +				return;
> +			}
> +		}
> +
> +		type = I40E_DIRECT_REARM_TYPE_NORMAL;
> +	}
> +
> +	const __m512i iova_offsets =  _mm512_set1_epi64
> +		(offsetof(struct rte_mbuf, buf_iova));
> +	const __m512i headroom = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
> +
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> +	/* to shuffle the addresses to correct slots. Values 4-7 will contain
> +	 * zeros, so use 7 for a zero-value.
> +	 */
> +	const __m512i permute_idx = _mm512_set_epi64(7, 7, 3, 1, 7, 7, 2, 0);
> +#else
> +	const __m512i permute_idx = _mm512_set_epi64(7, 3, 6, 2, 5, 1, 4, 0);
> +#endif
> +
> +	__m512i mbuf_ptrs;
> +
> +	/* Initialize the mbufs in vector, process 8 mbufs in one loop, taking
> +	 * from mempool cache and populating both shadow and HW rings
> +	 */
> +	for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH / 8; i++) {
> +		switch (type) {
> +		case I40E_DIRECT_REARM_TYPE_FAST_FREE:
> +			mbuf_ptrs = _mm512_loadu_si512(rxep);
> +			break;
> +		case I40E_DIRECT_REARM_TYPE_PRE_FREE:
> +			mbuf_ptrs = _mm512_loadu_si512(&m[j]);
> +			_mm512_store_si512(rxep, mbuf_ptrs);
> +			j += 8;
> +			break;
> +		case I40E_DIRECT_REARM_TYPE_NORMAL:
> +			mbuf_ptrs = _mm512_loadu_si512
> +				(&cache->objs[cache->len - 8]);
> +			_mm512_store_si512(rxep, mbuf_ptrs);
> +			cache->len -= 8;
> +			break;
> +		}
> +
> +		/* gather iova of mbuf0-7 into one zmm reg */
> +		const __m512i iova_base_addrs = _mm512_i64gather_epi64
> +			(_mm512_add_epi64(mbuf_ptrs, iova_offsets),
> +				0, /* base */
> +				1 /* scale */);
> +		const __m512i iova_addrs = _mm512_add_epi64(iova_base_addrs,
> +				headroom);
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> +		const __m512i iovas0 = _mm512_castsi256_si512
> +			(_mm512_extracti64x4_epi64(iova_addrs, 0));
> +		const __m512i iovas1 = _mm512_castsi256_si512
> +			(_mm512_extracti64x4_epi64(iova_addrs, 1));
> +
> +		/* permute leaves desc 2-3 addresses in header address slots 0-1
> +		 * but these are ignored by driver since header split not
> +		 * enabled. Similarly for desc 4 & 5.
> +		 */
> +		const __m512i desc_rd_0_1 = _mm512_permutexvar_epi64
> +			(permute_idx, iovas0);
> +		const __m512i desc_rd_2_3 = _mm512_bsrli_epi128(desc_rd_0_1, 8);
> +
> +		const __m512i desc_rd_4_5 = _mm512_permutexvar_epi64
> +			(permute_idx, iovas1);
> +		const __m512i desc_rd_6_7 = _mm512_bsrli_epi128(desc_rd_4_5, 8);
> +
> +		_mm512_store_si512((void *)rxdp, desc_rd_0_1);
> +		_mm512_store_si512((void *)(rxdp + 2), desc_rd_2_3);
> +		_mm512_store_si512((void *)(rxdp + 4), desc_rd_4_5);
> +		_mm512_store_si512((void *)(rxdp + 6), desc_rd_6_7);
> +#else
> +		/* permute leaves desc 4-7 addresses in header address slots 0-3
> +		 * but these are ignored by driver since header split not
> +		 * enabled.
> +		 */
> +		const __m512i desc_rd_0_3 = _mm512_permutexvar_epi64
> +			(permute_idx, iova_addrs);
> +		const __m512i desc_rd_4_7 = _mm512_bsrli_epi128(desc_rd_0_3, 8);
> +
> +		_mm512_store_si512((void *)rxdp, desc_rd_0_3);
> +		_mm512_store_si512((void *)(rxdp + 4), desc_rd_4_7);
> +#endif
> +		rxdp += 8, rxep += 8;
> +	}
> +
> +	/* Update the descriptor initializer index */
> +	rxq->rxrearm_start += n;
> +	rx_id = rxq->rxrearm_start - 1;
> +
> +	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> +		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> +		if (!rxq->rxrearm_start)
> +			rx_id = rxq->nb_rx_desc - 1;
> +		else
> +			rx_id = rxq->rxrearm_start - 1;
> +	}
> +
> +	rxq->rxrearm_nb -= n;
> +
> +	/* Update the tail pointer on the NIC */
> +	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> +}
> +
>   #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
>   /* Handles 32B descriptor FDIR ID processing:
>    * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
> @@ -252,8 +493,12 @@ _recv_raw_pkts_vec_avx512(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>   	/* See if we need to rearm the RX queue - gives the prefetch a bit
>   	 * of time to act
>   	 */
> -	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> -		i40e_rxq_rearm(rxq);
> +	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> +		if (rxq->direct_rxrearm_enable)
> +			i40e_rxq_direct_rearm(rxq);
> +		else
> +			i40e_rxq_rearm(rxq);
> +	}
>   
>   	/* Before we start moving massive data around, check to see if
>   	 * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> index fa9e6582c5..dc78e3c90b 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> @@ -77,6 +77,139 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
>   	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
>   }
>   
> +static inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> +	struct rte_eth_dev *dev;
> +	struct i40e_tx_queue *txq;
> +	volatile union i40e_rx_desc *rxdp;
> +	struct i40e_tx_entry *txep;
> +	struct i40e_rx_entry *rxep;
> +	uint16_t tx_port_id, tx_queue_id;
> +	uint16_t rx_id;
> +	struct rte_mbuf *mb0, *mb1, *m;
> +	uint64x2_t dma_addr0, dma_addr1;
> +	uint64x2_t zero = vdupq_n_u64(0);
> +	uint64_t paddr;
> +	uint16_t i, n;
> +	uint16_t nb_rearm = 0;
> +
> +	rxdp = rxq->rx_ring + rxq->rxrearm_start;
> +	rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	tx_port_id = rxq->direct_rxrearm_port;
> +	tx_queue_id = rxq->direct_rxrearm_queue;
> +	dev = &rte_eth_devices[tx_port_id];
> +	txq = dev->data->tx_queues[tx_queue_id];
> +
> +	/* check Rx queue is able to take in the whole
> +	 * batch of free mbufs from Tx queue
> +	 */
> +	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> +		/* check DD bits on threshold descriptor */
> +		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> +				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> +				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> +			goto mempool_bulk;
> +		}
> +
> +		n = txq->tx_rs_thresh;
> +
> +		/* first buffer to free from S/W ring is at index
> +		 * tx_next_dd - (tx_rs_thresh-1)
> +		 */
> +		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +
> +		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +			/* directly put mbufs from Tx to Rx,
> +			 * and initialize the mbufs in vector
> +			 */
> +			for (i = 0; i < n; i++, rxep++, txep++) {
> +				rxep[0].mbuf = txep[0].mbuf;
> +
> +				/* Initialize rxdp descs */
> +				mb0 = txep[0].mbuf;
> +
> +				paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
> +				dma_addr0 = vdupq_n_u64(paddr);
> +				/* flush desc with pa dma_addr */
> +				vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> +			}
> +		} else {
> +			for (i = 0; i < n; i++) {
> +				m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +				if (m != NULL) {
> +					rxep[i].mbuf = m;
> +
> +					/* Initialize rxdp descs */
> +					paddr = m->buf_iova + RTE_PKTMBUF_HEADROOM;
> +					dma_addr0 = vdupq_n_u64(paddr);
> +					/* flush desc with pa dma_addr */
> +					vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> +					nb_rearm++;
> +				}
> +			}
> +			n = nb_rearm;
> +		}
> +
> +		/* update counters for Tx */
> +		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +		if (txq->tx_next_dd >= txq->nb_tx_desc)
> +			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +	} else {
> +mempool_bulk:
> +		/* if TX did not free bufs into Rx sw-ring,
> +		 * get new bufs from mempool
> +		 */
> +		n = RTE_I40E_RXQ_REARM_THRESH;
> +		if (unlikely(rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0)) {
> +			if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
> +				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> +					rxep[i].mbuf = &rxq->fake_mbuf;
> +					vst1q_u64((uint64_t *)&rxdp[i].read, zero);
> +				}
> +			}
> +			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed += n;
> +			return;
> +		}
> +
> +		/* Initialize the mbufs in vector, process 2 mbufs in one loop */
> +		for (i = 0; i < n; i += 2, rxep += 2) {
> +			mb0 = rxep[0].mbuf;
> +			mb1 = rxep[1].mbuf;
> +
> +			paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
> +			dma_addr0 = vdupq_n_u64(paddr);
> +			/* flush desc with pa dma_addr */
> +			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> +
> +			paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
> +			dma_addr1 = vdupq_n_u64(paddr);
> +			/* flush desc with pa dma_addr */
> +			vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
> +		}
> +	}
> +
> +	/* Update the descriptor initializer index */
> +	rxq->rxrearm_start += n;
> +	rx_id = rxq->rxrearm_start - 1;
> +
> +	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> +		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> +		if (!rxq->rxrearm_start)
> +			rx_id = rxq->nb_rx_desc - 1;
> +		else
> +			rx_id = rxq->rxrearm_start - 1;
> +	}
> +
> +	rxq->rxrearm_nb -= n;
> +
> +	rte_io_wmb();
> +	/* Update the tail pointer on the NIC */
> +	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +}
> +
>   #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
>   /* NEON version of FDIR mark extraction for 4 32B descriptors at a time */
>   static inline uint32x4_t
> @@ -381,8 +514,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
>   	/* See if we need to rearm the RX queue - gives the prefetch a bit
>   	 * of time to act
>   	 */
> -	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> -		i40e_rxq_rearm(rxq);
> +	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> +		if (rxq->direct_rxrearm_enable)
> +			i40e_rxq_direct_rearm(rxq);
> +		else
> +			i40e_rxq_rearm(rxq);
> +	}
>   
>   	/* Before we start moving massive data around, check to see if
>   	 * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
> index 3782e8052f..b2f1ab2c8d 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
> @@ -89,6 +89,168 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
>   	I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
>   }
>   
> +static inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> +	struct rte_eth_dev *dev;
> +	struct i40e_tx_queue *txq;
> +	volatile union i40e_rx_desc *rxdp;
> +	struct i40e_tx_entry *txep;
> +	struct i40e_rx_entry *rxep;
> +	uint16_t tx_port_id, tx_queue_id;
> +	uint16_t rx_id;
> +	struct rte_mbuf *mb0, *mb1, *m;
> +	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
> +			RTE_PKTMBUF_HEADROOM);
> +	__m128i dma_addr0, dma_addr1;
> +	__m128i vaddr0, vaddr1;
> +	uint16_t i, n;
> +	uint16_t nb_rearm = 0;
> +
> +	rxdp = rxq->rx_ring + rxq->rxrearm_start;
> +	rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> +	tx_port_id = rxq->direct_rxrearm_port;
> +	tx_queue_id = rxq->direct_rxrearm_queue;
> +	dev = &rte_eth_devices[tx_port_id];
> +	txq = dev->data->tx_queues[tx_queue_id];
> +
> +	/* check Rx queue is able to take in the whole
> +	 * batch of free mbufs from Tx queue
> +	 */
> +	if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> +		/* check DD bits on threshold descriptor */
> +		if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> +				rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> +				rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> +			goto mempool_bulk;
> +		}
> +
> +		n = txq->tx_rs_thresh;
> +
> +		/* first buffer to free from S/W ring is at index
> +		 * tx_next_dd - (tx_rs_thresh-1)
> +		 */
> +		txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +
> +		if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> +			/* directly put mbufs from Tx to Rx,
> +			 * and initialize the mbufs in vector
> +			 */
> +			for (i = 0; i < n; i++, rxep++, txep++) {
> +				rxep[0].mbuf = txep[0].mbuf;
> +
> +				/* Initialize rxdp descs */
> +				mb0 = txep[0].mbuf;
> +
> +				/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +				RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +						offsetof(struct rte_mbuf, buf_addr) + 8);
> +				vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +
> +				/* convert pa to dma_addr hdr/data */
> +				dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +
> +				/* add headroom to pa values */
> +				dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +
> +				/* flush desc with pa dma_addr */
> +				_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> +			}
> +		} else {
> +			for (i = 0; i < n; i++) {
> +				m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> +				if (m != NULL) {
> +					rxep[i].mbuf = m;
> +
> +					/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +					RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +							offsetof(struct rte_mbuf, buf_addr) + 8);
> +					vaddr0 = _mm_loadu_si128((__m128i *)&m->buf_addr);
> +
> +					/* convert pa to dma_addr hdr/data */
> +					dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +
> +					/* add headroom to pa values */
> +					dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +
> +					/* flush desc with pa dma_addr */
> +					_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> +					nb_rearm++;
> +				}
> +			}
> +			n = nb_rearm;
> +		}
> +
> +		/* update counters for Tx */
> +		txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> +		txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> +		if (txq->tx_next_dd >= txq->nb_tx_desc)
> +			txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +	} else {
> +mempool_bulk:
> +		/* if TX did not free bufs into Rx sw-ring,
> +		 * get new bufs from mempool
> +		 */
> +		n = RTE_I40E_RXQ_REARM_THRESH;
> +		/* Pull 'n' more MBUFs into the software ring */
> +		if (rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0) {
> +			if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
> +				dma_addr0 = _mm_setzero_si128();
> +				for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> +					rxep[i].mbuf = &rxq->fake_mbuf;
> +					_mm_store_si128((__m128i *)&rxdp[i].read,
> +							dma_addr0);
> +				}
> +			}
> +			rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> +				RTE_I40E_RXQ_REARM_THRESH;
> +			return;
> +		}
> +
> +		/* Initialize the mbufs in vector, process 2 mbufs in one loop */
> +		for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH; i += 2, rxep += 2) {
> +			mb0 = rxep[0].mbuf;
> +			mb1 = rxep[1].mbuf;
> +
> +			/* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> +			RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> +					offsetof(struct rte_mbuf, buf_addr) + 8);
> +			vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +			vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +
> +			/* convert pa to dma_addr hdr/data */
> +			dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +			dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
> +
> +			/* add headroom to pa values */
> +			dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +			dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
> +
> +			/* flush desc with pa dma_addr */
> +			_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> +			_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
> +		}
> +	}
> +
> +	/* Update the descriptor initializer index */
> +	rxq->rxrearm_start += n;
> +	rx_id = rxq->rxrearm_start - 1;
> +
> +	if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> +		rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> +		if (!rxq->rxrearm_start)
> +			rx_id = rxq->nb_rx_desc - 1;
> +		else
> +			rx_id = rxq->rxrearm_start - 1;
> +	}
> +
> +	rxq->rxrearm_nb -= n;
> +
> +	/* Update the tail pointer on the NIC */
> +	I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +}
> +
>   #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
>   /* SSE version of FDIR mark extraction for 4 32B descriptors at a time */
>   static inline __m128i
> @@ -394,8 +556,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
>   	/* See if we need to rearm the RX queue - gives the prefetch a bit
>   	 * of time to act
>   	 */
> -	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> -		i40e_rxq_rearm(rxq);
> +	if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> +		if (rxq->direct_rxrearm_enable)
> +			i40e_rxq_direct_rearm(rxq);
> +		else
> +			i40e_rxq_rearm(rxq);
> +	}
>   
>   	/* Before we start moving massive data around, check to see if
>   	 * there is actually a packet available


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/5] net/i40e: add direct rearm mode internal API
  2022-04-20  8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
@ 2022-05-11 22:31   ` Konstantin Ananyev
  0 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:31 UTC (permalink / raw)
  To: Feifei Wang, Beilei Xing; +Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang

20/04/2022 09:16, Feifei Wang пишет:
> For direct rearm mode, add two internal functions.
> 
> One is to enable direct rearm mode in Rx queue.
> 
> The other is to map Tx queue with Rx queue to make Rx queue take
> buffers from the specific Tx queue.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   drivers/net/i40e/i40e_ethdev.c | 34 ++++++++++++++++++++++++++++++++++
>   1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index 755786dc10..9e1a523bcc 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -369,6 +369,13 @@ static int i40e_dev_rx_queue_intr_enable(struct rte_eth_dev *dev,
>   static int i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev,
>   					  uint16_t queue_id);
>   
> +static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
> +						uint16_t queue_id);
> +static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
> +						uint16_t rx_queue_id,
> +						uint16_t tx_port_id,
> +						uint16_t tx_queue_id);
> +
>   static int i40e_get_regs(struct rte_eth_dev *dev,
>   			 struct rte_dev_reg_info *regs);
>   
> @@ -477,6 +484,8 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
>   	.rx_queue_setup               = i40e_dev_rx_queue_setup,
>   	.rx_queue_intr_enable         = i40e_dev_rx_queue_intr_enable,
>   	.rx_queue_intr_disable        = i40e_dev_rx_queue_intr_disable,
> +	.rx_queue_direct_rearm_enable = i40e_dev_rx_queue_direct_rearm_enable,
> +	.rx_queue_direct_rearm_map    = i40e_dev_rx_queue_direct_rearm_map,
>   	.rx_queue_release             = i40e_dev_rx_queue_release,
>   	.tx_queue_setup               = i40e_dev_tx_queue_setup,
>   	.tx_queue_release             = i40e_dev_tx_queue_release,
> @@ -11108,6 +11117,31 @@ i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
>   	return 0;
>   }
>   
> +static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
> +			uint16_t queue_id)
> +{
> +	struct i40e_rx_queue *rxq;
> +
> +	rxq = dev->data->rx_queues[queue_id];
> +	rxq->direct_rxrearm_enable = 1;
> +
> +	return 0;
> +}
> +
> +static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
> +				uint16_t rx_queue_id, uint16_t tx_port_id,
> +				uint16_t tx_queue_id)
> +{
> +	struct i40e_rx_queue *rxq;
> +
> +	rxq = dev->data->rx_queues[rx_queue_id];
> +
> +	rxq->direct_rxrearm_port = tx_port_id;
> +	rxq->direct_rxrearm_queue = tx_queue_id;

I don't think this function should not enable that mode blindly.
Instead, it needs to check first that all pre-conditions are met
(tx/rx threshold values are equal, etc.).


> +
> +	return 0;
> +}
> +
>   /**
>    * This function is used to check if the register is valid.
>    * Below is the valid registers list for X722 only:


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
  2022-04-20  8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
  2022-04-20 10:10   ` Morten Brørup
@ 2022-05-11 22:33   ` Konstantin Ananyev
  1 sibling, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:33 UTC (permalink / raw)
  To: dev

20/04/2022 09:16, Feifei Wang пишет:
> Enable direct rearm mode. The mapping is decided in the data plane based
> on the first packet received.
> 
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>   examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index bec22c44cd..38ffdf4636 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>   	unsigned lcore_id;
>   	uint64_t prev_tsc, diff_tsc, cur_tsc;
>   	int i, nb_rx;
> -	uint16_t portid;
> +	uint16_t portid, tx_portid;
>   	uint8_t queueid;
>   	struct lcore_conf *qconf;
>   	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>   
>   	const uint16_t n_rx_q = qconf->n_rx_queue;
>   	const uint16_t n_tx_p = qconf->n_tx_port;
> +	int direct_rearm_map[n_rx_q];
> +
>   	if (n_rx_q == 0) {
>   		RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
>   		return 0;
> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>   
>   		portid = qconf->rx_queue_list[i].port_id;
>   		queueid = qconf->rx_queue_list[i].queue_id;
> +		direct_rearm_map[i] = 0;
>   		RTE_LOG(INFO, L3FWD,
>   			" -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>   			lcore_id, portid, queueid);
> @@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
>   			if (nb_rx == 0)
>   				continue;
>   
> +			/* Determine the direct rearm mapping based on the first
> +			 * packet received on the rx queue
> +			 */
> +			if (direct_rearm_map[i] == 0) {
> +				tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
> +							portid);
> +				rte_eth_direct_rxrearm_map(portid, queueid,
> +								tx_portid, queueid);
> +				direct_rearm_map[i] = 1;
> +			}
> +

That just doesn't look right to me: why to make decision based on the 
first packet?
What would happen if second and all other packets have to be routed
to different ports?
In fact, this direct-rearm mode seems suitable only for hard-coded
one to one mapped forwarding (examples/l2fwd, testpmd).
For l3fwd it can be used safely only when we have one port in use.
Also I think it should be selected at init-time and
it shouldn't be on by default.
To summarize, my opinion:
special cmd-line parameter to enable it.
allowable only when we run l3fwd over one port.


>   #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>   			 || defined RTE_ARCH_PPC_64
>   			l3fwd_lpm_send_packets(nb_rx, pkts_burst,


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
  2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
                   ` (4 preceding siblings ...)
  2022-04-20  8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
@ 2022-05-11 23:00 ` Konstantin Ananyev
       [not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
  6 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 23:00 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd


> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
> 
> However, this solution poses several constraints:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.

Just to re-iterate some generic concerns about this proposal:
  - We effectively link RX and TX queues - when this feature is enabled,
    user can't stop TX queue without stopping linked RX queue first.
    Right now user is free to start/stop any queues at his will.
    If that feature will allow to link queues from different ports,
    then even ports will become dependent and user will have to pay extra
    care when managing such ports.
- very limited usage scenario - it will have a positive effect only
   when we have a fixed forwarding mapping: all (or nearly all) packets
   from the RX queue are forwarded into the same TX queue.

Wonder did you had a chance to consider mempool-cache ZC API,
similar to one we have for the ring?
It would allow us on TX free path to avoid copying mbufs to
temporary array on the stack.
Instead we can put them straight from TX SW ring to the mempool cache.
That should save extra store/load for mbuf and might help to achieve 
some performance gain without by-passing mempool.
It probably wouldn't be as fast as what you proposing,
but might be fast enough to consider as alternative.
Again, it would be a generic one, so we can avoid all
these implications and limitations.


> 2)The other factor that needs to be considered is 'run-to-completion' vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 
> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
>    Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
>    side.
> 3)L3fwd application is modified to do the direct rearm mapping
> automatically without user config. This follows the rules that the
> thread can map TX queue to a RX queue based on the first received
> package destination port.
> 
> Testing status:
> 1.The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> enabled direct rearm
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> without fast-free mode		with fast-free mode
> 	+14.1%				+7.0%
> 
> Ampere Altra(neon path):
> without fast-free mode		with fast-free mode
> 	+17.1				+14.0%
> 
> X86:
> Dell-8268(limit frequency):
> sse path:
> without fast-free mode		with fast-free mode
> 	+6.96%				+2.02%
> avx2 path:
> without fast-free mode		with fast-free mode
> 	+9.04%				+7.75%
> avx512 path:
> without fast-free mode		with fast-free mode
> 	+5.43%				+1.57%
> -------------------------------------------------------------------
> This patch can not affect base performance of normal mode.
> Furthermore, the reason for that limiting the CPU frequency is
> that dell-8268 can encounter i40e NIC bottleneck with maximum
> frequency.
> 
> 2.The testing results for VPP-L3fwd are as follows:
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> with direct re-arm mode enabled
> 	+7.0%
> -------------------------------------------------------------------
> For Ampere Altra and X86,VPP-L3fwd test has not been done.
> 
> Reference:
> [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g
> 
> Feifei Wang (5):
>    net/i40e: remove redundant Dtype initialization
>    net/i40e: enable direct rearm mode
>    ethdev: add API for direct rearm mode
>    net/i40e: add direct rearm mode internal API
>    examples/l3fwd: enable direct rearm mode
> 
>   drivers/net/i40e/i40e_ethdev.c          |  34 +++
>   drivers/net/i40e/i40e_rxtx.c            |   4 -
>   drivers/net/i40e/i40e_rxtx.h            |   4 +
>   drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
>   drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
>   drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
>   examples/l3fwd/l3fwd_lpm.c              |  16 +-
>   lib/ethdev/ethdev_driver.h              |  15 ++
>   lib/ethdev/rte_ethdev.c                 |  14 ++
>   lib/ethdev/rte_ethdev.h                 |  31 +++
>   lib/ethdev/version.map                  |   1 +
>   13 files changed, 949 insertions(+), 13 deletions(-)
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
       [not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
@ 2022-05-24  1:25   ` Konstantin Ananyev
  2022-05-24 12:40     ` Morten Brørup
  2022-05-24 20:14     ` Honnappa Nagarahalli
  0 siblings, 2 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2022-05-24  1:25 UTC (permalink / raw)
  To: Feifei Wang; +Cc: nd, dev, ruifeng.wang, honnappa.nagarahalli

16/05/2022 07:10, Feifei Wang пишет:
> 
>>> Currently, the transmit side frees the buffers into the lcore cache and
>>> the receive side allocates buffers from the lcore cache. The transmit
>>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>>> lcore cache. The receive side allocates 32 buffers and stores them in
>>> the receive side software ring, resulting in 32*8=256B of stores and
>>> 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the 256B
>>> of loads and stores introduced by the lcore cache. It also frees up the
>>> cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraints:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues to
>>> pull the buffers from.
> 
>> Just to re-iterate some generic concerns about this proposal:
>>   - We effectively link RX and TX queues - when this feature is enabled,
>>     user can't stop TX queue without stopping linked RX queue first.
>>     Right now user is free to start/stop any queues at his will.
>>     If that feature will allow to link queues from different ports,
>>     then even ports will become dependent and user will have to pay extra
>>     care when managing such ports.
> 
> [Feifei] When direct rearm enabled, there are two path for thread to 
> choose. If
> there are enough Tx freed buffers, Rx can put buffers from Tx.
> Otherwise, Rx will put buffers from mempool as usual. Thus, users do not
> need to pay much attention managing ports.

What I am talking about: right now different port or different queues of
the same port can be treated as independent entities:
in general user is free to start/stop (and even reconfigure in some 
cases) one entity without need to stop other entity.
I.E user can stop and re-configure TX queue while keep receiving packets
from RX queue.
With direct re-arm enabled, I think it wouldn't be possible any more:
before stopping/reconfiguring TX queue user would have make sure that
corresponding RX queue wouldn't be used by datapath.

> 
>> - very limited usage scenario - it will have a positive effect only
>>    when we have a fixed forwarding mapping: all (or nearly all) packets
>>    from the RX queue are forwarded into the same TX queue.
> 
> [Feifei] Although the usage scenario is limited, this usage scenario has 
> a wide
> range of applications, such as NIC with one port.

yes, there are NICs with one port, but no guarantee there wouldn't be 
several such NICs within the system.

> Furtrhermore, I think this is a tradeoff between performance and 
> flexibility.
> Our goal is to achieve best performance, this means we need to give up some
> flexibility decisively. For example of 'FAST_FREE Mode', it deletes most
> of the buffer check (refcnt > 1, external buffer, chain buffer), chooses a
> shorest path, and then achieve significant performance improvement.
>> Wonder did you had a chance to consider mempool-cache ZC API,
>> similar to one we have for the ring?
>> It would allow us on TX free path to avoid copying mbufs to
>> temporary array on the stack.
>> Instead we can put them straight from TX SW ring to the mempool cache.
>> That should save extra store/load for mbuf and might help to achieve 
>> some performance gain without by-passing mempool.
>> It probably wouldn't be as fast as what you proposing,
>> but might be fast enough to consider as alternative.
>> Again, it would be a generic one, so we can avoid all
>> these implications and limitations.
> 
> [Feifei] I think this is a good try. However, the most important thing
> is that if we can bypass the mempool decisively to pursue the
> significant performance gains.

I understand the intention, and I personally think this is wrong
and dangerous attitude.
We have mempool abstraction in place for very good reason.
So we need to try to improve mempool performance (and API if necessary) 
at first place, not to avoid it and break our own rules and recommendations.


> For ZC, there maybe a problem for it in i40e. The reason for that put Tx 
> buffers
> into temporary is that i40e_tx_entry includes buffer pointer and index.
> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> firstlt extract mbuf pointer. Finally, though we use ZC, we still can't 
> avoid
> using a temporary stack to extract Tx buffer pointers.

When talking about ZC API for mempool cache I meant something like:
void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc, 
uint32_t *nb_elem, uint32_t flags);
void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t 
nb_elem);
i.e. _start_ will return user a pointer inside mp-cache where to put 
free elems and max number of slots that can be safely filled.
_finish_ will update mc->len.
As an example:

/* expect to free N mbufs */
uint32_t n = N;
void **p = mempool_cache_put_zc_start(mc, &n, ...);

/* free up to n elems */
for (i = 0; i != n; i++) {

   /* get next free mbuf from somewhere */
   mb = extract_and_prefree_mbuf(...);

   /* no more free mbufs for now */
   if (mb == NULL)
      break;

   p[i] = mb;
}

/* finalize ZC put, with _i_ freed elems */
mempool_cache_put_zc_finish(mc, i);

That way, I think we can overcome the issue with i40e_tx_entry
you mentioned above. Plus it might be useful in other similar places.

Another alternative is obviously to split i40e_tx_entry into two structs
(one for mbuf, second for its metadata) and have a separate array for 
each of them.
Though with that approach we need to make sure no perf drops will be
introduced, plus probably more code changes will be required.






^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
  2022-05-24  1:25   ` Konstantin Ananyev
@ 2022-05-24 12:40     ` Morten Brørup
  2022-05-24 20:14     ` Honnappa Nagarahalli
  1 sibling, 0 replies; 27+ messages in thread
From: Morten Brørup @ 2022-05-24 12:40 UTC (permalink / raw)
  To: Konstantin Ananyev, Feifei Wang
  Cc: nd, dev, ruifeng.wang, honnappa.nagarahalli

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Tuesday, 24 May 2022 03.26


> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give
> up some
> > flexibility decisively. For example of 'FAST_FREE Mode', it deletes
> most
> > of the buffer check (refcnt > 1, external buffer, chain buffer),
> chooses a
> > shorest path, and then achieve significant performance improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API,
> >> similar to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to
> >> temporary array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing,
> >> but might be fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all
> >> these implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important
> thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong
> and dangerous attitude.
> We have mempool abstraction in place for very good reason.

Yes, but the abstraction is being violated grossly elsewhere, and mempool code is copy-pasted elsewhere too.

A good example of the current situation is [1]. The cache multiplier (a definition private to the mempool library) is required for some copy-pasted code, and the solution is to expose the private definition and make it part of the public API.

[1] http://inbox.dpdk.org/dev/DM4PR12MB53893BF4C7861068FE8A943BDFFE9@DM4PR12MB5389.namprd12.prod.outlook.com/

The game of abstraction has already been lost. Performance won. :-(

Since we allow bypassing the mbuf/mempool library for other features, it should be allowed for this feature too.

I would even say: Why are the drivers using the mempool library, and not the mbuf library, when freeing and allocating mbufs? This behavior bypasses all the debug assertions in the mbuf library.

As you can probably see, I'm certainly not happy about the abstraction violations in DPDK. But they have been allowed for similar features, so they should be allowed here too.

> So we need to try to improve mempool performance (and API if necessary)
> at first place, not to avoid it and break our own rules and
> recommendations.
> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> Tx
> > buffers
> > into temporary is that i40e_tx_entry includes buffer pointer and
> index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> can't
> > avoid
> > using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags);
> void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem);
> i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */
> mempool_cache_put_zc_finish(mc, i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry
> you mentioned above. Plus it might be useful in other similar places.

Great example. This would fit perfectly into the i40e driver, if it didn't already implement the exact same by accessing the mempool cache structure directly. :-(

BTW: I tried patching the mempool library to fix some asymmetry bugs [2], but couldn't get any ACKs for it. It seems to me that the community is too risk averse to dare to modify such a core library.

[2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/

However, adding your mempool ZC feature is not a modification, but an addition, so it should be able to gather support.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
  2022-05-24  1:25   ` Konstantin Ananyev
  2022-05-24 12:40     ` Morten Brørup
@ 2022-05-24 20:14     ` Honnappa Nagarahalli
  1 sibling, 0 replies; 27+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-24 20:14 UTC (permalink / raw)
  To: Konstantin Ananyev, Feifei Wang
  Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd

<snip>

> 
> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> previously sent you email, but may not be that person. Learn why this could
> be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> 
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >>   - We effectively link RX and TX queues - when this feature is enabled,
> >>     user can't stop TX queue without stopping linked RX queue first.
> >>     Right now user is free to start/stop any queues at his will.
> >>     If that feature will allow to link queues from different ports,
> >>     then even ports will become dependent and user will have to pay extra
> >>     care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
> 
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets from
> RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
I am trying to understand the problem better. For the TX queue to be stopped, the user must have blocked the data plane from accessing the TX queue. Like Feifei says, the RX side has the normal packet allocation path still available.
Also this sounds like a corner case to me, we can handle this through checks in the queue_stop API.

> 
> >
> >> - very limited usage scenario - it will have a positive effect only
> >>    when we have a fixed forwarding mapping: all (or nearly all) packets
> >>    from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
> 
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
What I see in my interactions is, a single NIC/DPU is under utilized for a 2 socket system. Some are adding more sockets to the system to better utilize the DPU. The NIC bandwidth continues to grow significantly. I do not think there will be a multi-DPU per server scenario.

> 
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
The abstraction can be thought of at a higher level. i.e. the driver manages the buffer allocation/free and is hidden from the application. The application does not need to be aware of how these changes are implemented. 

> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void mempool_cache_put_zc_finish(struct
> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return user a
> pointer inside mp-cache where to put free elems and max number of slots
> that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
> 
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of
> them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.
Commit '5171b4ee6b6" already does this (in a different way), but just for AVX512. Unfortunately, it does not record any performance improvements. We could port this to Arm NEON and look at the performance.

> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-05-24 20:14 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20  8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
2022-04-20  8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
2022-04-20  8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
2022-05-11 22:28   ` Konstantin Ananyev
2022-04-20  8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
2022-04-20  9:59   ` Morten Brørup
2022-04-29  2:42     ` 回复: " Feifei Wang
2022-04-20 10:41   ` Andrew Rybchenko
2022-04-29  6:28     ` 回复: " Feifei Wang
2022-05-10 22:49       ` Honnappa Nagarahalli
2022-04-20 10:50   ` Jerin Jacob
2022-05-02  3:09     ` 回复: " Feifei Wang
2022-04-21 14:57   ` Stephen Hemminger
2022-04-29  6:35     ` 回复: " Feifei Wang
2022-04-20  8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
2022-05-11 22:31   ` Konstantin Ananyev
2022-04-20  8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
2022-04-20 10:10   ` Morten Brørup
2022-04-21  2:35     ` Honnappa Nagarahalli
2022-04-21  6:40       ` Morten Brørup
2022-05-10 22:01         ` Honnappa Nagarahalli
2022-05-11  7:17           ` Morten Brørup
2022-05-11 22:33   ` Konstantin Ananyev
2022-05-11 23:00 ` [PATCH v1 0/5] Direct re-arming of buffers on receive side Konstantin Ananyev
     [not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
2022-05-24  1:25   ` Konstantin Ananyev
2022-05-24 12:40     ` Morten Brørup
2022-05-24 20:14     ` Honnappa Nagarahalli

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git