* [PATCH v1 0/5] Direct re-arming of buffers on receive side
@ 2022-04-20 8:16 Feifei Wang
2022-04-20 8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
` (13 more replies)
0 siblings, 14 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.
However, this solution poses several constraints:
1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.
If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.
2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.
3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.
Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
side.
3)L3fwd application is modified to do the direct rearm mapping
automatically without user config. This follows the rules that the
thread can map TX queue to a RX queue based on the first received
package destination port.
Testing status:
1.The testing results for L3fwd are as follows:
-------------------------------------------------------------------
enabled direct rearm
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
without fast-free mode with fast-free mode
+14.1% +7.0%
Ampere Altra(neon path):
without fast-free mode with fast-free mode
+17.1 +14.0%
X86:
Dell-8268(limit frequency):
sse path:
without fast-free mode with fast-free mode
+6.96% +2.02%
avx2 path:
without fast-free mode with fast-free mode
+9.04% +7.75%
avx512 path:
without fast-free mode with fast-free mode
+5.43% +1.57%
-------------------------------------------------------------------
This patch can not affect base performance of normal mode.
Furthermore, the reason for that limiting the CPU frequency is
that dell-8268 can encounter i40e NIC bottleneck with maximum
frequency.
2.The testing results for VPP-L3fwd are as follows:
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
with direct re-arm mode enabled
+7.0%
-------------------------------------------------------------------
For Ampere Altra and X86,VPP-L3fwd test has not been done.
Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g
Feifei Wang (5):
net/i40e: remove redundant Dtype initialization
net/i40e: enable direct rearm mode
ethdev: add API for direct rearm mode
net/i40e: add direct rearm mode internal API
examples/l3fwd: enable direct rearm mode
drivers/net/i40e/i40e_ethdev.c | 34 +++
drivers/net/i40e/i40e_rxtx.c | 4 -
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
drivers/net/i40e/i40e_rxtx_vec_avx2.c | 14 +-
drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_neon.c | 141 ++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_sse.c | 170 ++++++++++++++-
examples/l3fwd/l3fwd_lpm.c | 16 +-
lib/ethdev/ethdev_driver.h | 15 ++
lib/ethdev/rte_ethdev.c | 14 ++
lib/ethdev/rte_ethdev.h | 31 +++
lib/ethdev/version.map | 1 +
13 files changed, 949 insertions(+), 13 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
@ 2022-04-20 8:16 ` Feifei Wang
2022-04-20 8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
` (12 subsequent siblings)
13 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
To: Beilei Xing; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
The Dtype field is set to 0xf by the NIC to indicate DMA completion, only
after the CPU requests to be informed by setting the RS bit. Hence, it is
not required to set Dtype to 0xf during initialization.
Not setting the Dtype field to 0xf helps to know that a given descriptor
is not sent to the NIC yet after initialization.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_rxtx.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 25a28ecea2..745734d5e4 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -2767,10 +2767,6 @@ i40e_reset_tx_queue(struct i40e_tx_queue *txq)
prev = (uint16_t)(txq->nb_tx_desc - 1);
for (i = 0; i < txq->nb_tx_desc; i++) {
- volatile struct i40e_tx_desc *txd = &txq->tx_ring[i];
-
- txd->cmd_type_offset_bsz =
- rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE);
txe[i].mbuf = NULL;
txe[i].last_id = i;
txe[prev].next_id = i;
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v1 2/5] net/i40e: enable direct rearm mode
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
2022-04-20 8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
@ 2022-04-20 8:16 ` Feifei Wang
2022-05-11 22:28 ` Konstantin Ananyev
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
` (11 subsequent siblings)
13 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
To: Beilei Xing, Bruce Richardson, Konstantin Ananyev, Ruifeng Wang
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli
For i40e driver, enable direct re-arm mode. This patch supports the case
of mapping Rx/Tx queues from the same single lcore.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
drivers/net/i40e/i40e_rxtx_vec_avx2.c | 14 +-
drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_neon.c | 141 ++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_sse.c | 170 ++++++++++++++-
6 files changed, 839 insertions(+), 8 deletions(-)
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..1fdf4305f4 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -102,6 +102,8 @@ struct i40e_rx_queue {
uint16_t rxrearm_nb; /**< number of remaining to be re-armed */
uint16_t rxrearm_start; /**< the idx we start the re-arming from */
+ uint16_t direct_rxrearm_port; /** device TX port ID for direct re-arm mode */
+ uint16_t direct_rxrearm_queue; /** TX queue index for direct re-arm mode */
uint64_t mbuf_initializer; /**< value to init mbufs */
uint16_t port_id; /**< device port ID */
@@ -121,6 +123,8 @@ struct i40e_rx_queue {
uint16_t rx_using_sse; /**<flag indicate the usage of vPMD for rx */
uint8_t dcb_tc; /**< Traffic class of rx queue */
uint64_t offloads; /**< Rx offload flags of RTE_ETH_RX_OFFLOAD_* */
+ /**< 0 if direct re-arm mode disabled, 1 when enabled */
+ bool direct_rxrearm_enable;
const struct rte_memzone *mz;
};
diff --git a/drivers/net/i40e/i40e_rxtx_common_avx.h b/drivers/net/i40e/i40e_rxtx_common_avx.h
index cfc1e63173..a742723e07 100644
--- a/drivers/net/i40e/i40e_rxtx_common_avx.h
+++ b/drivers/net/i40e/i40e_rxtx_common_avx.h
@@ -209,6 +209,275 @@ i40e_rxq_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
/* Update the tail pointer on the NIC */
I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
}
+
+static __rte_always_inline void
+i40e_rxq_direct_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
+{
+ struct rte_eth_dev *dev;
+ struct i40e_tx_queue *txq;
+ volatile union i40e_rx_desc *rxdp;
+ struct i40e_tx_entry *txep;
+ struct i40e_rx_entry *rxep;
+ struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
+ uint16_t tx_port_id, tx_queue_id;
+ uint16_t rx_id;
+ uint16_t i, n;
+ uint16_t nb_rearm = 0;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ tx_port_id = rxq->direct_rxrearm_port;
+ tx_queue_id = rxq->direct_rxrearm_queue;
+ dev = &rte_eth_devices[tx_port_id];
+ txq = dev->data->tx_queues[tx_queue_id];
+
+ /* check Rx queue is able to take in the whole
+ * batch of free mbufs from Tx queue
+ */
+ if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+ goto mempool_bulk;
+ }
+
+ if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
+ goto mempool_bulk;
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* directly put mbufs from Tx to Rx,
+ * and initialize the mbufs in vector
+ */
+ for (i = 0; i < n; i++)
+ rxep[i].mbuf = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ /* ensure each Tx freed buffer is valid */
+ if (m[i] != NULL)
+ nb_rearm++;
+ }
+
+ if (nb_rearm != n) {
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ goto mempool_bulk;
+ } else {
+ for (i = 0; i < n; i++)
+ rxep[i].mbuf = m[i];
+ }
+ }
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+ } else {
+mempool_bulk:
+ /* if TX did not free bufs into Rx sw-ring,
+ * get new bufs from mempool
+ */
+ n = RTE_I40E_RXQ_REARM_THRESH;
+
+ /* Pull 'n' more MBUFs into the software ring */
+ if (rte_mempool_get_bulk(rxq->mp,
+ (void *)rxep,
+ RTE_I40E_RXQ_REARM_THRESH) < 0) {
+ if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
+ rxq->nb_rx_desc) {
+ __m128i dma_addr0;
+ dma_addr0 = _mm_setzero_si128();
+ for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+ rxep[i].mbuf = &rxq->fake_mbuf;
+ _mm_store_si128((__m128i *)&rxdp[i].read,
+ dma_addr0);
+ }
+ }
+ rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_I40E_RXQ_REARM_THRESH;
+ return;
+ }
+ }
+
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+ struct rte_mbuf *mb0, *mb1;
+ __m128i dma_addr0, dma_addr1;
+ __m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+ RTE_PKTMBUF_HEADROOM);
+ /* Initialize the mbufs in vector, process 2 mbufs in one loop */
+ for (i = 0; i < n; i += 2, rxep += 2) {
+ __m128i vaddr0, vaddr1;
+
+ mb0 = rxep[0].mbuf;
+ mb1 = rxep[1].mbuf;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+ vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+ dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+ /* add headroom to pa values */
+ dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+ dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+ }
+#else
+#ifdef __AVX512VL__
+ if (avx512) {
+ struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+ struct rte_mbuf *mb4, *mb5, *mb6, *mb7;
+ __m512i dma_addr0_3, dma_addr4_7;
+ __m512i hdr_room = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
+ /* Initialize the mbufs in vector, process 8 mbufs in one loop */
+ for (i = 0; i < n; i += 8, rxep += 8, rxdp += 8) {
+ __m128i vaddr0, vaddr1, vaddr2, vaddr3;
+ __m128i vaddr4, vaddr5, vaddr6, vaddr7;
+ __m256i vaddr0_1, vaddr2_3;
+ __m256i vaddr4_5, vaddr6_7;
+ __m512i vaddr0_3, vaddr4_7;
+
+ mb0 = rxep[0].mbuf;
+ mb1 = rxep[1].mbuf;
+ mb2 = rxep[2].mbuf;
+ mb3 = rxep[3].mbuf;
+ mb4 = rxep[4].mbuf;
+ mb5 = rxep[5].mbuf;
+ mb6 = rxep[6].mbuf;
+ mb7 = rxep[7].mbuf;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+ vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+ vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+ vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+ vaddr4 = _mm_loadu_si128((__m128i *)&mb4->buf_addr);
+ vaddr5 = _mm_loadu_si128((__m128i *)&mb5->buf_addr);
+ vaddr6 = _mm_loadu_si128((__m128i *)&mb6->buf_addr);
+ vaddr7 = _mm_loadu_si128((__m128i *)&mb7->buf_addr);
+
+ /**
+ * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+ * into the high lanes. Similarly for 2 & 3, and so on.
+ */
+ vaddr0_1 =
+ _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
+ vaddr1, 1);
+ vaddr2_3 =
+ _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
+ vaddr3, 1);
+ vaddr4_5 =
+ _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr4),
+ vaddr5, 1);
+ vaddr6_7 =
+ _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr6),
+ vaddr7, 1);
+ vaddr0_3 =
+ _mm512_inserti64x4(_mm512_castsi256_si512(vaddr0_1),
+ vaddr2_3, 1);
+ vaddr4_7 =
+ _mm512_inserti64x4(_mm512_castsi256_si512(vaddr4_5),
+ vaddr6_7, 1);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0_3 = _mm512_unpackhi_epi64(vaddr0_3, vaddr0_3);
+ dma_addr4_7 = _mm512_unpackhi_epi64(vaddr4_7, vaddr4_7);
+
+ /* add headroom to pa values */
+ dma_addr0_3 = _mm512_add_epi64(dma_addr0_3, hdr_room);
+ dma_addr4_7 = _mm512_add_epi64(dma_addr4_7, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm512_store_si512((__m512i *)&rxdp->read, dma_addr0_3);
+ _mm512_store_si512((__m512i *)&(rxdp + 4)->read, dma_addr4_7);
+ }
+ } else {
+#endif /* __AVX512VL__*/
+ struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
+ __m256i dma_addr0_1, dma_addr2_3;
+ __m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
+ /* Initialize the mbufs in vector, process 4 mbufs in one loop */
+ for (i = 0; i < n; i += 4, rxep += 4, rxdp += 4) {
+ __m128i vaddr0, vaddr1, vaddr2, vaddr3;
+ __m256i vaddr0_1, vaddr2_3;
+
+ mb0 = rxep[0].mbuf;
+ mb1 = rxep[1].mbuf;
+ mb2 = rxep[2].mbuf;
+ mb3 = rxep[3].mbuf;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+ vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+ vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
+ vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
+
+ /**
+ * merge 0 & 1, by casting 0 to 256-bit and inserting 1
+ * into the high lanes. Similarly for 2 & 3
+ */
+ vaddr0_1 = _mm256_inserti128_si256
+ (_mm256_castsi128_si256(vaddr0), vaddr1, 1);
+ vaddr2_3 = _mm256_inserti128_si256
+ (_mm256_castsi128_si256(vaddr2), vaddr3, 1);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
+ dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
+
+ /* add headroom to pa values */
+ dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
+ dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
+ _mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
+ }
+ }
+
+#endif
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += n;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+
+ rxq->rxrearm_nb -= n;
+
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
+}
#endif /* __AVX2__*/
#endif /*_I40E_RXTX_COMMON_AVX_H_*/
diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
index c73b2a321b..fcb7ba0273 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
@@ -25,6 +25,12 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
return i40e_rxq_rearm_common(rxq, false);
}
+static __rte_always_inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+ return i40e_rxq_direct_rearm_common(rxq, false);
+}
+
#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
/* Handles 32B descriptor FDIR ID processing:
* rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
@@ -128,8 +134,12 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
/* See if we need to rearm the RX queue - gives the prefetch a bit
* of time to act
*/
- if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
- i40e_rxq_rearm(rxq);
+ if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+ if (rxq->direct_rxrearm_enable)
+ i40e_rxq_direct_rearm(rxq);
+ else
+ i40e_rxq_rearm(rxq);
+ }
/* Before we start moving massive data around, check to see if
* there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx512.c b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
index 2e8a3f0df6..d967095edc 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_avx512.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
@@ -21,6 +21,12 @@
#define RTE_I40E_DESCS_PER_LOOP_AVX 8
+enum i40e_direct_rearm_type_value {
+ I40E_DIRECT_REARM_TYPE_NORMAL = 0x0,
+ I40E_DIRECT_REARM_TYPE_FAST_FREE = 0x1,
+ I40E_DIRECT_REARM_TYPE_PRE_FREE = 0x2,
+};
+
static __rte_always_inline void
i40e_rxq_rearm(struct i40e_rx_queue *rxq)
{
@@ -150,6 +156,241 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
}
+static __rte_always_inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+ struct rte_eth_dev *dev;
+ struct i40e_tx_queue *txq;
+ volatile union i40e_rx_desc *rxdp;
+ struct i40e_vec_tx_entry *txep;
+ struct i40e_rx_entry *rxep;
+ struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
+ uint16_t tx_port_id, tx_queue_id;
+ uint16_t rx_id;
+ uint16_t i, n;
+ uint16_t j = 0;
+ uint16_t nb_rearm = 0;
+ enum i40e_direct_rearm_type_value type;
+ struct rte_mempool_cache *cache = NULL;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ tx_port_id = rxq->direct_rxrearm_port;
+ tx_queue_id = rxq->direct_rxrearm_queue;
+ dev = &rte_eth_devices[tx_port_id];
+ txq = dev->data->tx_queues[tx_queue_id];
+
+ /* check Rx queue is able to take in the whole
+ * batch of free mbufs from Tx queue
+ */
+ if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+ goto mempool_bulk;
+ }
+
+ if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
+ goto mempool_bulk;
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = (void *)txq->sw_ring;
+ txep += txq->tx_next_dd - (n - 1);
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* directly put mbufs from Tx to Rx */
+ uint32_t copied = 0;
+ /* n is multiple of 32 */
+ while (copied < n) {
+ const __m512i a = _mm512_load_si512(&txep[copied]);
+ const __m512i b = _mm512_load_si512(&txep[copied + 8]);
+ const __m512i c = _mm512_load_si512(&txep[copied + 16]);
+ const __m512i d = _mm512_load_si512(&txep[copied + 24]);
+
+ _mm512_storeu_si512(&rxep[copied], a);
+ _mm512_storeu_si512(&rxep[copied + 8], b);
+ _mm512_storeu_si512(&rxep[copied + 16], c);
+ _mm512_storeu_si512(&rxep[copied + 24], d);
+ copied += 32;
+ }
+ type = I40E_DIRECT_REARM_TYPE_FAST_FREE;
+ } else {
+ for (i = 0; i < n; i++) {
+ m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ /* ensure each Tx freed buffer is valid */
+ if (m[i] != NULL)
+ nb_rearm++;
+ }
+
+ if (nb_rearm != n) {
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ goto mempool_bulk;
+ } else {
+ type = I40E_DIRECT_REARM_TYPE_PRE_FREE;
+ }
+ }
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+ } else {
+mempool_bulk:
+ cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id());
+
+ if (unlikely(!cache))
+ return i40e_rxq_rearm_common(rxq, true);
+
+ n = RTE_I40E_RXQ_REARM_THRESH;
+
+ /* We need to pull 'n' more MBUFs into the software ring from mempool
+ * We inline the mempool function here, so we can vectorize the copy
+ * from the cache into the shadow ring.
+ */
+
+ if (cache->len < RTE_I40E_RXQ_REARM_THRESH) {
+ /* No. Backfill the cache first, and then fill from it */
+ uint32_t req = RTE_I40E_RXQ_REARM_THRESH + (cache->size -
+ cache->len);
+
+ /* How many do we require
+ * i.e. number to fill the cache + the request
+ */
+ int ret = rte_mempool_ops_dequeue_bulk(rxq->mp,
+ &cache->objs[cache->len], req);
+ if (ret == 0) {
+ cache->len += req;
+ } else {
+ if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
+ rxq->nb_rx_desc) {
+ __m128i dma_addr0;
+
+ dma_addr0 = _mm_setzero_si128();
+ for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+ rxep[i].mbuf = &rxq->fake_mbuf;
+ _mm_store_si128
+ ((__m128i *)&rxdp[i].read,
+ dma_addr0);
+ }
+ }
+ rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_I40E_RXQ_REARM_THRESH;
+ return;
+ }
+ }
+
+ type = I40E_DIRECT_REARM_TYPE_NORMAL;
+ }
+
+ const __m512i iova_offsets = _mm512_set1_epi64
+ (offsetof(struct rte_mbuf, buf_iova));
+ const __m512i headroom = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
+
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+ /* to shuffle the addresses to correct slots. Values 4-7 will contain
+ * zeros, so use 7 for a zero-value.
+ */
+ const __m512i permute_idx = _mm512_set_epi64(7, 7, 3, 1, 7, 7, 2, 0);
+#else
+ const __m512i permute_idx = _mm512_set_epi64(7, 3, 6, 2, 5, 1, 4, 0);
+#endif
+
+ __m512i mbuf_ptrs;
+
+ /* Initialize the mbufs in vector, process 8 mbufs in one loop, taking
+ * from mempool cache and populating both shadow and HW rings
+ */
+ for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH / 8; i++) {
+ switch (type) {
+ case I40E_DIRECT_REARM_TYPE_FAST_FREE:
+ mbuf_ptrs = _mm512_loadu_si512(rxep);
+ break;
+ case I40E_DIRECT_REARM_TYPE_PRE_FREE:
+ mbuf_ptrs = _mm512_loadu_si512(&m[j]);
+ _mm512_store_si512(rxep, mbuf_ptrs);
+ j += 8;
+ break;
+ case I40E_DIRECT_REARM_TYPE_NORMAL:
+ mbuf_ptrs = _mm512_loadu_si512
+ (&cache->objs[cache->len - 8]);
+ _mm512_store_si512(rxep, mbuf_ptrs);
+ cache->len -= 8;
+ break;
+ }
+
+ /* gather iova of mbuf0-7 into one zmm reg */
+ const __m512i iova_base_addrs = _mm512_i64gather_epi64
+ (_mm512_add_epi64(mbuf_ptrs, iova_offsets),
+ 0, /* base */
+ 1 /* scale */);
+ const __m512i iova_addrs = _mm512_add_epi64(iova_base_addrs,
+ headroom);
+#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
+ const __m512i iovas0 = _mm512_castsi256_si512
+ (_mm512_extracti64x4_epi64(iova_addrs, 0));
+ const __m512i iovas1 = _mm512_castsi256_si512
+ (_mm512_extracti64x4_epi64(iova_addrs, 1));
+
+ /* permute leaves desc 2-3 addresses in header address slots 0-1
+ * but these are ignored by driver since header split not
+ * enabled. Similarly for desc 4 & 5.
+ */
+ const __m512i desc_rd_0_1 = _mm512_permutexvar_epi64
+ (permute_idx, iovas0);
+ const __m512i desc_rd_2_3 = _mm512_bsrli_epi128(desc_rd_0_1, 8);
+
+ const __m512i desc_rd_4_5 = _mm512_permutexvar_epi64
+ (permute_idx, iovas1);
+ const __m512i desc_rd_6_7 = _mm512_bsrli_epi128(desc_rd_4_5, 8);
+
+ _mm512_store_si512((void *)rxdp, desc_rd_0_1);
+ _mm512_store_si512((void *)(rxdp + 2), desc_rd_2_3);
+ _mm512_store_si512((void *)(rxdp + 4), desc_rd_4_5);
+ _mm512_store_si512((void *)(rxdp + 6), desc_rd_6_7);
+#else
+ /* permute leaves desc 4-7 addresses in header address slots 0-3
+ * but these are ignored by driver since header split not
+ * enabled.
+ */
+ const __m512i desc_rd_0_3 = _mm512_permutexvar_epi64
+ (permute_idx, iova_addrs);
+ const __m512i desc_rd_4_7 = _mm512_bsrli_epi128(desc_rd_0_3, 8);
+
+ _mm512_store_si512((void *)rxdp, desc_rd_0_3);
+ _mm512_store_si512((void *)(rxdp + 4), desc_rd_4_7);
+#endif
+ rxdp += 8, rxep += 8;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += n;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+
+ rxq->rxrearm_nb -= n;
+
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
+}
+
#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
/* Handles 32B descriptor FDIR ID processing:
* rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
@@ -252,8 +493,12 @@ _recv_raw_pkts_vec_avx512(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
/* See if we need to rearm the RX queue - gives the prefetch a bit
* of time to act
*/
- if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
- i40e_rxq_rearm(rxq);
+ if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+ if (rxq->direct_rxrearm_enable)
+ i40e_rxq_direct_rearm(rxq);
+ else
+ i40e_rxq_rearm(rxq);
+ }
/* Before we start moving massive data around, check to see if
* there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index fa9e6582c5..dc78e3c90b 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -77,6 +77,139 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
}
+static inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+ struct rte_eth_dev *dev;
+ struct i40e_tx_queue *txq;
+ volatile union i40e_rx_desc *rxdp;
+ struct i40e_tx_entry *txep;
+ struct i40e_rx_entry *rxep;
+ uint16_t tx_port_id, tx_queue_id;
+ uint16_t rx_id;
+ struct rte_mbuf *mb0, *mb1, *m;
+ uint64x2_t dma_addr0, dma_addr1;
+ uint64x2_t zero = vdupq_n_u64(0);
+ uint64_t paddr;
+ uint16_t i, n;
+ uint16_t nb_rearm = 0;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ tx_port_id = rxq->direct_rxrearm_port;
+ tx_queue_id = rxq->direct_rxrearm_queue;
+ dev = &rte_eth_devices[tx_port_id];
+ txq = dev->data->tx_queues[tx_queue_id];
+
+ /* check Rx queue is able to take in the whole
+ * batch of free mbufs from Tx queue
+ */
+ if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+ goto mempool_bulk;
+ }
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* directly put mbufs from Tx to Rx,
+ * and initialize the mbufs in vector
+ */
+ for (i = 0; i < n; i++, rxep++, txep++) {
+ rxep[0].mbuf = txep[0].mbuf;
+
+ /* Initialize rxdp descs */
+ mb0 = txep[0].mbuf;
+
+ paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr0 = vdupq_n_u64(paddr);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+ }
+ } else {
+ for (i = 0; i < n; i++) {
+ m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ if (m != NULL) {
+ rxep[i].mbuf = m;
+
+ /* Initialize rxdp descs */
+ paddr = m->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr0 = vdupq_n_u64(paddr);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+ nb_rearm++;
+ }
+ }
+ n = nb_rearm;
+ }
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+ } else {
+mempool_bulk:
+ /* if TX did not free bufs into Rx sw-ring,
+ * get new bufs from mempool
+ */
+ n = RTE_I40E_RXQ_REARM_THRESH;
+ if (unlikely(rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0)) {
+ if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
+ for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+ rxep[i].mbuf = &rxq->fake_mbuf;
+ vst1q_u64((uint64_t *)&rxdp[i].read, zero);
+ }
+ }
+ rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed += n;
+ return;
+ }
+
+ /* Initialize the mbufs in vector, process 2 mbufs in one loop */
+ for (i = 0; i < n; i += 2, rxep += 2) {
+ mb0 = rxep[0].mbuf;
+ mb1 = rxep[1].mbuf;
+
+ paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr0 = vdupq_n_u64(paddr);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
+
+ paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr1 = vdupq_n_u64(paddr);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
+ }
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += n;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+
+ rxq->rxrearm_nb -= n;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
/* NEON version of FDIR mark extraction for 4 32B descriptors at a time */
static inline uint32x4_t
@@ -381,8 +514,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
/* See if we need to rearm the RX queue - gives the prefetch a bit
* of time to act
*/
- if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
- i40e_rxq_rearm(rxq);
+ if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+ if (rxq->direct_rxrearm_enable)
+ i40e_rxq_direct_rearm(rxq);
+ else
+ i40e_rxq_rearm(rxq);
+ }
/* Before we start moving massive data around, check to see if
* there is actually a packet available
diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
index 3782e8052f..b2f1ab2c8d 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
@@ -89,6 +89,168 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
}
+static inline void
+i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
+{
+ struct rte_eth_dev *dev;
+ struct i40e_tx_queue *txq;
+ volatile union i40e_rx_desc *rxdp;
+ struct i40e_tx_entry *txep;
+ struct i40e_rx_entry *rxep;
+ uint16_t tx_port_id, tx_queue_id;
+ uint16_t rx_id;
+ struct rte_mbuf *mb0, *mb1, *m;
+ __m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
+ RTE_PKTMBUF_HEADROOM);
+ __m128i dma_addr0, dma_addr1;
+ __m128i vaddr0, vaddr1;
+ uint16_t i, n;
+ uint16_t nb_rearm = 0;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ tx_port_id = rxq->direct_rxrearm_port;
+ tx_queue_id = rxq->direct_rxrearm_queue;
+ dev = &rte_eth_devices[tx_port_id];
+ txq = dev->data->tx_queues[tx_queue_id];
+
+ /* check Rx queue is able to take in the whole
+ * batch of free mbufs from Tx queue
+ */
+ if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
+ goto mempool_bulk;
+ }
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* directly put mbufs from Tx to Rx,
+ * and initialize the mbufs in vector
+ */
+ for (i = 0; i < n; i++, rxep++, txep++) {
+ rxep[0].mbuf = txep[0].mbuf;
+
+ /* Initialize rxdp descs */
+ mb0 = txep[0].mbuf;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+
+ /* add headroom to pa values */
+ dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+ }
+ } else {
+ for (i = 0; i < n; i++) {
+ m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ if (m != NULL) {
+ rxep[i].mbuf = m;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&m->buf_addr);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+
+ /* add headroom to pa values */
+ dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+ nb_rearm++;
+ }
+ }
+ n = nb_rearm;
+ }
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+ } else {
+mempool_bulk:
+ /* if TX did not free bufs into Rx sw-ring,
+ * get new bufs from mempool
+ */
+ n = RTE_I40E_RXQ_REARM_THRESH;
+ /* Pull 'n' more MBUFs into the software ring */
+ if (rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0) {
+ if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
+ dma_addr0 = _mm_setzero_si128();
+ for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
+ rxep[i].mbuf = &rxq->fake_mbuf;
+ _mm_store_si128((__m128i *)&rxdp[i].read,
+ dma_addr0);
+ }
+ }
+ rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
+ RTE_I40E_RXQ_REARM_THRESH;
+ return;
+ }
+
+ /* Initialize the mbufs in vector, process 2 mbufs in one loop */
+ for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH; i += 2, rxep += 2) {
+ mb0 = rxep[0].mbuf;
+ mb1 = rxep[1].mbuf;
+
+ /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
+ RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
+ offsetof(struct rte_mbuf, buf_addr) + 8);
+ vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
+ vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
+
+ /* convert pa to dma_addr hdr/data */
+ dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
+ dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
+
+ /* add headroom to pa values */
+ dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+ dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
+
+ /* flush desc with pa dma_addr */
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
+ _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
+ }
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += n;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+
+ rxq->rxrearm_nb -= n;
+
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
/* SSE version of FDIR mark extraction for 4 32B descriptors at a time */
static inline __m128i
@@ -394,8 +556,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
/* See if we need to rearm the RX queue - gives the prefetch a bit
* of time to act
*/
- if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
- i40e_rxq_rearm(rxq);
+ if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
+ if (rxq->direct_rxrearm_enable)
+ i40e_rxq_direct_rearm(rxq);
+ else
+ i40e_rxq_rearm(rxq);
+ }
/* Before we start moving massive data around, check to see if
* there is actually a packet available
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
2022-04-20 8:16 ` [PATCH v1 1/5] net/i40e: remove redundant Dtype initialization Feifei Wang
2022-04-20 8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
@ 2022-04-20 8:16 ` Feifei Wang
2022-04-20 9:59 ` Morten Brørup
` (3 more replies)
2022-04-20 8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
` (10 subsequent siblings)
13 siblings, 4 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Add API for enabling direct rearm mode and for mapping RX and TX
queues. Currently, the API supports 1:1(txq : rxq) mapping.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
lib/ethdev/version.map | 1 +
4 files changed, 61 insertions(+)
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..22022f6da9 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
uint16_t rx_queue_id);
+/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
+typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
+ uint16_t queue_id);
+
+/**< @internal map Rx/Tx queue of direct rearm mode */
+typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ uint16_t tx_port_id,
+ uint16_t tx_queue_id);
+
/** @internal Release memory resources allocated by given Rx/Tx queue. */
typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
uint16_t queue_id);
@@ -1152,6 +1162,11 @@ struct eth_dev_ops {
/** Disable Rx queue interrupt */
eth_rx_disable_intr_t rx_queue_intr_disable;
+ /** Enable Rx queue direct rearm mode */
+ eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
+ /** Map Rx/Tx queue for direct rearm mode */
+ eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
+
eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx queue */
eth_queue_release_t tx_queue_release; /**< Release Tx queue */
eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 29a3d80466..8e6f0284f4 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
return eth_err(port_id, ret);
}
+int
+rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id)
+{
+ struct rte_eth_dev *dev;
+
+ dev = &rte_eth_devices[rx_port_id];
+ (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
+ (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
+ tx_port_id, tx_queue_id);
+
+ return 0;
+}
+
int
rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
{
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..4a431fcbed 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5190,6 +5190,37 @@ __rte_experimental
int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
struct rte_eth_hairpin_cap *cap);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
+ * buffers that have completed transmission on the transmit side.
+ *
+ * @note
+ * It is assumed that the buffers have completed transmission belong to the
+ * mempool used at the receive side, and have refcnt = 1.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ *
+ * @return
+ * - (0) if successful.
+ */
+__rte_experimental
+int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id);
+
/**
* @warning
* @b EXPERIMENTAL: this structure may change without prior notice.
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab29e..68d664498c 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -279,6 +279,7 @@ EXPERIMENTAL {
rte_flow_async_action_handle_create;
rte_flow_async_action_handle_destroy;
rte_flow_async_action_handle_update;
+ rte_eth_direct_rxrearm_map;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v1 4/5] net/i40e: add direct rearm mode internal API
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (2 preceding siblings ...)
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
@ 2022-04-20 8:16 ` Feifei Wang
2022-05-11 22:31 ` Konstantin Ananyev
2022-04-20 8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
` (9 subsequent siblings)
13 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
To: Beilei Xing; +Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
For direct rearm mode, add two internal functions.
One is to enable direct rearm mode in Rx queue.
The other is to map Tx queue with Rx queue to make Rx queue take
buffers from the specific Tx queue.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 755786dc10..9e1a523bcc 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -369,6 +369,13 @@ static int i40e_dev_rx_queue_intr_enable(struct rte_eth_dev *dev,
static int i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev,
uint16_t queue_id);
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+ uint16_t queue_id);
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ uint16_t tx_port_id,
+ uint16_t tx_queue_id);
+
static int i40e_get_regs(struct rte_eth_dev *dev,
struct rte_dev_reg_info *regs);
@@ -477,6 +484,8 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.rx_queue_setup = i40e_dev_rx_queue_setup,
.rx_queue_intr_enable = i40e_dev_rx_queue_intr_enable,
.rx_queue_intr_disable = i40e_dev_rx_queue_intr_disable,
+ .rx_queue_direct_rearm_enable = i40e_dev_rx_queue_direct_rearm_enable,
+ .rx_queue_direct_rearm_map = i40e_dev_rx_queue_direct_rearm_map,
.rx_queue_release = i40e_dev_rx_queue_release,
.tx_queue_setup = i40e_dev_tx_queue_setup,
.tx_queue_release = i40e_dev_tx_queue_release,
@@ -11108,6 +11117,31 @@ i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
return 0;
}
+static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
+ uint16_t queue_id)
+{
+ struct i40e_rx_queue *rxq;
+
+ rxq = dev->data->rx_queues[queue_id];
+ rxq->direct_rxrearm_enable = 1;
+
+ return 0;
+}
+
+static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id, uint16_t tx_port_id,
+ uint16_t tx_queue_id)
+{
+ struct i40e_rx_queue *rxq;
+
+ rxq = dev->data->rx_queues[rx_queue_id];
+
+ rxq->direct_rxrearm_port = tx_port_id;
+ rxq->direct_rxrearm_queue = tx_queue_id;
+
+ return 0;
+}
+
/**
* This function is used to check if the register is valid.
* Below is the valid registers list for X722 only:
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (3 preceding siblings ...)
2022-04-20 8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
@ 2022-04-20 8:16 ` Feifei Wang
2022-04-20 10:10 ` Morten Brørup
2022-05-11 22:33 ` Konstantin Ananyev
2022-05-11 23:00 ` [PATCH v1 0/5] Direct re-arming of buffers on receive side Konstantin Ananyev
` (8 subsequent siblings)
13 siblings, 2 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-20 8:16 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Enable direct rearm mode. The mapping is decided in the data plane based
on the first packet received.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
index bec22c44cd..38ffdf4636 100644
--- a/examples/l3fwd/l3fwd_lpm.c
+++ b/examples/l3fwd/l3fwd_lpm.c
@@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
unsigned lcore_id;
uint64_t prev_tsc, diff_tsc, cur_tsc;
int i, nb_rx;
- uint16_t portid;
+ uint16_t portid, tx_portid;
uint8_t queueid;
struct lcore_conf *qconf;
const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
@@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
const uint16_t n_rx_q = qconf->n_rx_queue;
const uint16_t n_tx_p = qconf->n_tx_port;
+ int direct_rearm_map[n_rx_q];
+
if (n_rx_q == 0) {
RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
return 0;
@@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
portid = qconf->rx_queue_list[i].port_id;
queueid = qconf->rx_queue_list[i].queue_id;
+ direct_rearm_map[i] = 0;
RTE_LOG(INFO, L3FWD,
" -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
lcore_id, portid, queueid);
@@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
if (nb_rx == 0)
continue;
+ /* Determine the direct rearm mapping based on the first
+ * packet received on the rx queue
+ */
+ if (direct_rearm_map[i] == 0) {
+ tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
+ portid);
+ rte_eth_direct_rxrearm_map(portid, queueid,
+ tx_portid, queueid);
+ direct_rearm_map[i] = 1;
+ }
+
#if defined RTE_ARCH_X86 || defined __ARM_NEON \
|| defined RTE_ARCH_PPC_64
l3fwd_lpm_send_packets(nb_rx, pkts_burst,
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
@ 2022-04-20 9:59 ` Morten Brørup
2022-04-29 2:42 ` 回复: " Feifei Wang
2022-04-20 10:41 ` Andrew Rybchenko
` (2 subsequent siblings)
3 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2022-04-20 9:59 UTC (permalink / raw)
To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
Ray Kinsella
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 20 April 2022 10.17
>
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> lib/ethdev/version.map | 1 +
> 4 files changed, 61 insertions(+)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> rte_eth_dev *dev,
> typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> uint16_t rx_queue_id);
>
> +/** @internal Enable direct rearm of a receive queue of an Ethernet
> device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> + uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id,
> + uint16_t tx_port_id,
> + uint16_t tx_queue_id);
> +
> /** @internal Release memory resources allocated by given Rx/Tx queue.
> */
> typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> /** Disable Rx queue interrupt */
> eth_rx_disable_intr_t rx_queue_intr_disable;
>
> + /** Enable Rx queue direct rearm mode */
> + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
A disable function seems to be missing.
> + /** Map Rx/Tx queue for direct rearm mode */
> + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> +
> eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx
> queue */
> eth_queue_release_t tx_queue_release; /**< Release Tx
> queue */
> eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring
> mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id,
> uint16_t tx_queue_id,
> return eth_err(port_id, ret);
> }
>
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> + struct rte_eth_dev *dev;
> +
> + dev = &rte_eth_devices[rx_port_id];
> + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> + tx_port_id, tx_queue_id);
Here you enable the mapping before you configure it. It could cause the driver to use an uninitialized map, if it processes packets between these two function calls.
Error handling is missing. Not all drivers support this feature, and the parameters should be validated.
Regarding driver support, the driver should also expose a capability flag to the application, similar to the RTE_ETH_DEV_CAPA_RXQ_SHARE or RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE flags. The documentation for this flag could include the description of all the restrictions to using it.
> +
> + return 0;
> +}
> +
> int
> rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
> {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
> int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> struct rte_eth_hairpin_cap *cap);
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-
> armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + * It is assumed that the buffers have completed transmission belong
> to the
> + * mempool used at the receive side, and have refcnt = 1.
> + *
> + * @param rx_port_id
> + * Port identifying the receive side.
> + * @param rx_queue_id
> + * The index of the receive queue identifying the receive side.
> + * The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> + * to rte_eth_dev_configure().
> + * @param tx_port_id
> + * Port identifying the transmit side.
> + * @param tx_queue_id
> + * The index of the transmit queue identifying the transmit side.
> + * The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> + * to rte_eth_dev_configure().
> + *
> + * @return
> + * - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id);
> +
I agree with the parameters to your proposed API here. Since the relevant use case only needs 1:1 mapping, exposing an API function to take some sort of array with N:M mappings would be premature, and probably not ever come into play anyway.
How do you remove, disable and/or change a mapping?
> /**
> * @warning
> * @b EXPERIMENTAL: this structure may change without prior notice.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 20391ab29e..68d664498c 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -279,6 +279,7 @@ EXPERIMENTAL {
> rte_flow_async_action_handle_create;
> rte_flow_async_action_handle_destroy;
> rte_flow_async_action_handle_update;
> + rte_eth_direct_rxrearm_map;
> };
>
> INTERNAL {
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-20 8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
@ 2022-04-20 10:10 ` Morten Brørup
2022-04-21 2:35 ` Honnappa Nagarahalli
2022-05-11 22:33 ` Konstantin Ananyev
1 sibling, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2022-04-20 10:10 UTC (permalink / raw)
To: Feifei Wang; +Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 20 April 2022 10.17
>
> Enable direct rearm mode. The mapping is decided in the data plane
> based
> on the first packet received.
I usually don't care much about l3fwd, but putting configuration changes in the fast path is just wrong!
Also, l3fwd is often used for benchmarking, and this small piece of code in the fast path will affect benchmark results (although only very little).
Please move it out of the fast path.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
2022-04-20 9:59 ` Morten Brørup
@ 2022-04-20 10:41 ` Andrew Rybchenko
2022-04-29 6:28 ` 回复: " Feifei Wang
2022-04-20 10:50 ` Jerin Jacob
2022-04-21 14:57 ` Stephen Hemminger
3 siblings, 1 reply; 145+ messages in thread
From: Andrew Rybchenko @ 2022-04-20 10:41 UTC (permalink / raw)
To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Ray Kinsella
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang
On 4/20/22 11:16, Feifei Wang wrote:
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> lib/ethdev/version.map | 1 +
> 4 files changed, 61 insertions(+)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
> typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> uint16_t rx_queue_id);
>
> +/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> + uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id,
> + uint16_t tx_port_id,
> + uint16_t tx_queue_id);
> +
> /** @internal Release memory resources allocated by given Rx/Tx queue. */
> typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> /** Disable Rx queue interrupt */
> eth_rx_disable_intr_t rx_queue_intr_disable;
>
> + /** Enable Rx queue direct rearm mode */
> + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> + /** Map Rx/Tx queue for direct rearm mode */
> + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> +
> eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx queue */
> eth_queue_release_t tx_queue_release; /**< Release Tx queue */
> eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
> return eth_err(port_id, ret);
> }
>
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> + struct rte_eth_dev *dev;
> +
> + dev = &rte_eth_devices[rx_port_id];
I think it is rather control path. So:
We need standard checks that rx_port_id is valid.
tx_port_id must be checked as well.
rx_queue_id and tx_queue_id must be checked to be in the rate.
> + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> + tx_port_id, tx_queue_id);
We must check that function pointers are not NULL as usual.
Return values must be checked.
Isn't is safe to setup map and than enable.
Otherwise we definitely need disable.
Also, what should happen on Tx port unplug? How to continue if
we still have Rx port up and running?
> +
> + return 0;
> +}
> +
> int
> rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
> {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
> int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> struct rte_eth_hairpin_cap *cap);
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + * It is assumed that the buffers have completed transmission belong to the
> + * mempool used at the receive side, and have refcnt = 1.
I think it is possible to avoid such limitations, but
implementation will be less optimized - more checks.
> + *
> + * @param rx_port_id
> + * Port identifying the receive side.
> + * @param rx_queue_id
> + * The index of the receive queue identifying the receive side.
> + * The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param tx_port_id
> + * Port identifying the transmit side.
I guess there is an assumption that Rx and Tx ports are
serviced by the same driver. If so and if it is an API
limitation, ethdev layer must check it.
> + * @param tx_queue_id
> + * The index of the transmit queue identifying the transmit side.
> + * The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + *
> + * @return
> + * - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id);
> +
> /**
> * @warning
> * @b EXPERIMENTAL: this structure may change without prior notice.
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 20391ab29e..68d664498c 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -279,6 +279,7 @@ EXPERIMENTAL {
> rte_flow_async_action_handle_create;
> rte_flow_async_action_handle_destroy;
> rte_flow_async_action_handle_update;
> + rte_eth_direct_rxrearm_map;
> };
>
> INTERNAL {
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
2022-04-20 9:59 ` Morten Brørup
2022-04-20 10:41 ` Andrew Rybchenko
@ 2022-04-20 10:50 ` Jerin Jacob
2022-05-02 3:09 ` 回复: " Feifei Wang
2022-04-21 14:57 ` Stephen Hemminger
3 siblings, 1 reply; 145+ messages in thread
From: Jerin Jacob @ 2022-04-20 10:50 UTC (permalink / raw)
To: Feifei Wang
Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
dpdk-dev, nd, Honnappa Nagarahalli, Ruifeng Wang
On Wed, Apr 20, 2022 at 1:47 PM Feifei Wang <feifei.wang2@arm.com> wrote:
>
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> + *
> + * @return
> + * - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id);
Won't existing rte_eth_hairpin_* APIs work to achieve the same?
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-20 10:10 ` Morten Brørup
@ 2022-04-21 2:35 ` Honnappa Nagarahalli
2022-04-21 6:40 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-04-21 2:35 UTC (permalink / raw)
To: Morten Brørup, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd
<snip>
>
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 20 April 2022 10.17
> >
> > Enable direct rearm mode. The mapping is decided in the data plane
> > based on the first packet received.
>
> I usually don't care much about l3fwd, but putting configuration changes in the
> fast path is just wrong!
I would say it depends. In this case the cycles consumed by the API are very less and configuration data is very small and is already in the cache as PMD has accessed the same data structure.
If the configuration needs more cycles than a typical (depending on the application) data plane packet processing needs or brings in enormous amount of data in to the cache, it should not be done on the data plane.
>
> Also, l3fwd is often used for benchmarking, and this small piece of code in the
> fast path will affect benchmark results (although only very little).
We do not see any impact on the performance numbers. The reason for putting in the data plane was it covers wider use case in this L3fwd application. If the app were to be simple, the configuration could be done from the control plane. Unfortunately, the performance of L3fwd application matters.
>
> Please move it out of the fast path.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-21 2:35 ` Honnappa Nagarahalli
@ 2022-04-21 6:40 ` Morten Brørup
2022-05-10 22:01 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2022-04-21 6:40 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, 21 April 2022 04.35
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 20 April 2022 10.17
> > >
> > > Enable direct rearm mode. The mapping is decided in the data plane
> > > based on the first packet received.
> >
> > I usually don't care much about l3fwd, but putting configuration
> changes in the
> > fast path is just wrong!
> I would say it depends. In this case the cycles consumed by the API are
> very less and configuration data is very small and is already in the
> cache as PMD has accessed the same data structure.
>
> If the configuration needs more cycles than a typical (depending on the
> application) data plane packet processing needs or brings in enormous
> amount of data in to the cache, it should not be done on the data
> plane.
>
As a matter of principle, configuration changes should be done outside the fast path.
If we allow an exception for this feature, it will set a bad precedent about where to put configuration code.
> >
> > Also, l3fwd is often used for benchmarking, and this small piece of
> code in the
> > fast path will affect benchmark results (although only very little).
> We do not see any impact on the performance numbers. The reason for
> putting in the data plane was it covers wider use case in this L3fwd
> application. If the app were to be simple, the configuration could be
> done from the control plane. Unfortunately, the performance of L3fwd
> application matters.
>
Let's proceed down that path for the sake of discussion... Then the fast path is missing runtime verification that all preconditions for using remapping are present at any time.
> >
> > Please move it out of the fast path.
BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable the feature.
And finally, this feature should be disabled by default, and only enabled by a command line parameter or similar. Otherwise, future l3fwd NIC performance reports will provide misleading performance results, if the feature is utilized. Application developers, when comparing NIC performance results, don't care about the performance for this unique use case; they care about the performance for the generic use case.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 8:16 ` [PATCH v1 3/5] ethdev: add API for " Feifei Wang
` (2 preceding siblings ...)
2022-04-20 10:50 ` Jerin Jacob
@ 2022-04-21 14:57 ` Stephen Hemminger
2022-04-29 6:35 ` 回复: " Feifei Wang
3 siblings, 1 reply; 145+ messages in thread
From: Stephen Hemminger @ 2022-04-21 14:57 UTC (permalink / raw)
To: Feifei Wang
Cc: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella,
dev, nd, Honnappa Nagarahalli, Ruifeng Wang
On Wed, 20 Apr 2022 16:16:48 +0800
Feifei Wang <feifei.wang2@arm.com> wrote:
> Add API for enabling direct rearm mode and for mapping RX and TX
> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> lib/ethdev/version.map | 1 +
> 4 files changed, 61 insertions(+)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc21d8..22022f6da9 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
> typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> uint16_t rx_queue_id);
>
> +/** @internal Enable direct rearm of a receive queue of an Ethernet device. */
> +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> + uint16_t queue_id);
> +
> +/**< @internal map Rx/Tx queue of direct rearm mode */
> +typedef int (*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id,
> + uint16_t tx_port_id,
> + uint16_t tx_queue_id);
> +
> /** @internal Release memory resources allocated by given Rx/Tx queue. */
> typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> uint16_t queue_id);
> @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> /** Disable Rx queue interrupt */
> eth_rx_disable_intr_t rx_queue_intr_disable;
>
> + /** Enable Rx queue direct rearm mode */
> + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> + /** Map Rx/Tx queue for direct rearm mode */
> + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> +
> eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx queue */
> eth_queue_release_t tx_queue_release; /**< Release Tx queue */
> eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80466..8e6f0284f4 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t port_id, uint16_t tx_queue_id,
> return eth_err(port_id, ret);
> }
>
> +int
> +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id)
> +{
> + struct rte_eth_dev *dev;
> +
> + dev = &rte_eth_devices[rx_port_id];
> + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev, rx_queue_id);
> + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> + tx_port_id, tx_queue_id);
> +
> + return 0;
> +}
> +
> int
> rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
> {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..4a431fcbed 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5190,6 +5190,37 @@ __rte_experimental
> int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> struct rte_eth_hairpin_cap *cap);
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Enable direct re-arm mode. In this mode the RX queue will be re-armed using
> + * buffers that have completed transmission on the transmit side.
> + *
> + * @note
> + * It is assumed that the buffers have completed transmission belong to the
> + * mempool used at the receive side, and have refcnt = 1.
> + *
> + * @param rx_port_id
> + * Port identifying the receive side.
> + * @param rx_queue_id
> + * The index of the receive queue identifying the receive side.
> + * The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param tx_port_id
> + * Port identifying the transmit side.
> + * @param tx_queue_id
> + * The index of the transmit queue identifying the transmit side.
> + * The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + *
> + * @return
> + * - (0) if successful.
> + */
> +__rte_experimental
> +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id);
Just looking at this.
Why is this done via API call and not a flag as part of the receive config?
All the other offload and configuration happens via dev config.
Doing it this way doesn't follow the existing ethdev model.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 9:59 ` Morten Brørup
@ 2022-04-29 2:42 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-29 2:42 UTC (permalink / raw)
To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, April 20, 2022 5:59 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@intel.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella <mdr@ashroe.eu>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
>
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 20 April 2022 10.17
> >
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> > lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> > lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> > lib/ethdev/version.map | 1 +
> > 4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev, typedef int (*eth_rx_disable_intr_t)(struct
> > rte_eth_dev *dev,
> > uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > device. */
> > +typedef int (*eth_rx_direct_rearm_enable_t)(struct rte_eth_dev *dev,
> > + uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > + uint16_t rx_queue_id,
> > + uint16_t tx_port_id,
> > + uint16_t tx_queue_id);
> > +
> > /** @internal Release memory resources allocated by given Rx/Tx queue.
> > */
> > typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> > uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> > /** Disable Rx queue interrupt */
> > eth_rx_disable_intr_t rx_queue_intr_disable;
> >
> > + /** Enable Rx queue direct rearm mode */
> > + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
>
> A disable function seems to be missing.
[Feifei] I will try to use offload bits to enable direct-rearm mode, thus this enable function will be
removed and disable function will be unnecessary.
>
> > + /** Map Rx/Tx queue for direct rearm mode */
> > + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> > +
> > eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx
> > queue */
> > eth_queue_release_t tx_queue_release; /**< Release Tx
> > queue */
> > eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring
> > mbufs */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> > port_id, uint16_t tx_queue_id,
> > return eth_err(port_id, ret);
> > }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id) {
> > + struct rte_eth_dev *dev;
> > +
> > + dev = &rte_eth_devices[rx_port_id];
> > + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > + tx_port_id, tx_queue_id);
>
> Here you enable the mapping before you configure it. It could cause the
> driver to use an uninitialized map, if it processes packets between these two
> function calls.
[Feifei] I agree with this and will change the code.
>
> Error handling is missing. Not all drivers support this feature, and the
> parameters should be validated.
[Feifei] You are right, I think after we use 'rxq->offload' bits, we can use some 'offload bits API'
to check if driver can support this.
For the parameters, I will add some check.
>
> Regarding driver support, the driver should also expose a capability flag to the
> application, similar to the RTE_ETH_DEV_CAPA_RXQ_SHARE or
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE flags. The documentation for this
> flag could include the description of all the restrictions to using it.
[Feifei] I will do like this by 'rxq->offload' bits, and add description to the documentation.
>
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port) { diff
> > --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental int
> > rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> > struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be re-
> > armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + * It is assumed that the buffers have completed transmission belong
> > to the
> > + * mempool used at the receive side, and have refcnt = 1.
> > + *
> > + * @param rx_port_id
> > + * Port identifying the receive side.
> > + * @param rx_queue_id
> > + * The index of the receive queue identifying the receive side.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> > supplied
> > + * to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + * Port identifying the transmit side.
> > + * @param tx_queue_id
> > + * The index of the transmit queue identifying the transmit side.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> > supplied
> > + * to rte_eth_dev_configure().
> > + *
> > + * @return
> > + * - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> > rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id);
> > +
>
> I agree with the parameters to your proposed API here. Since the relevant
> use case only needs 1:1 mapping, exposing an API function to take some sort
> of array with N:M mappings would be premature, and probably not ever come
> into play anyway.
>
> How do you remove, disable and/or change a mapping?
[Feifei] It is not recommended that users change the map in the process of sending and receiving packets,
which may bring some error risks. If user want to change mapping, he needs to stop the device and call
' rte_eth_direct_rxrearm_map ' API to rewrite the mapping.
Furthermore, for 'rxq->offload', user needs to set it before dev starts. If user want to change it, dev needs to be restarted.
>
> > /**
> > * @warning
> > * @b EXPERIMENTAL: this structure may change without prior notice.
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 20391ab29e..68d664498c 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -279,6 +279,7 @@ EXPERIMENTAL {
> > rte_flow_async_action_handle_create;
> > rte_flow_async_action_handle_destroy;
> > rte_flow_async_action_handle_update;
> > + rte_eth_direct_rxrearm_map;
> > };
> >
> > INTERNAL {
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 10:41 ` Andrew Rybchenko
@ 2022-04-29 6:28 ` Feifei Wang
2022-05-10 22:49 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2022-04-29 6:28 UTC (permalink / raw)
To: Andrew Rybchenko, thomas, Ferruh Yigit, Ray Kinsella
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
> -----邮件原件-----
> 发件人: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> 发送时间: Wednesday, April 20, 2022 6:41 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@intel.com>; Ray Kinsella <mdr@ashroe.eu>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
>
> On 4/20/22 11:16, Feifei Wang wrote:
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> > lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> > lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> > lib/ethdev/version.map | 1 +
> > 4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> rte_eth_dev *dev,
> > typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> > uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> rte_eth_dev *dev,
> > + uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > + uint16_t rx_queue_id,
> > + uint16_t tx_port_id,
> > + uint16_t tx_queue_id);
> > +
> > /** @internal Release memory resources allocated by given Rx/Tx queue.
> */
> > typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> > uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> > /** Disable Rx queue interrupt */
> > eth_rx_disable_intr_t rx_queue_intr_disable;
> >
> > + /** Enable Rx queue direct rearm mode */
> > + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > + /** Map Rx/Tx queue for direct rearm mode */
> > + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> > +
> > eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx
> queue */
> > eth_queue_release_t tx_queue_release; /**< Release Tx queue
> */
> > eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs
> */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> port_id, uint16_t tx_queue_id,
> > return eth_err(port_id, ret);
> > }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id) {
> > + struct rte_eth_dev *dev;
> > +
> > + dev = &rte_eth_devices[rx_port_id];
>
> I think it is rather control path. So:
> We need standard checks that rx_port_id is valid.
> tx_port_id must be checked as well.
> rx_queue_id and tx_queue_id must be checked to be in the rate.
[Feifei] You are right, I will add check for these.
>
> > + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > + tx_port_id, tx_queue_id);
>
> We must check that function pointers are not NULL as usual.
> Return values must be checked.
[Feifei] I agree with this, The check for pointer and return value will be added
> Isn't is safe to setup map and than enable.
> Otherwise we definitely need disable.
[Feifei] I will change code that map first and then set 'rxq->offload' to enable direct-rearm mode.
> Also, what should happen on Tx port unplug? How to continue if we still have
> Rx port up and running?
[Feifei] For direct rearm mode, if Tx port unplug, it means there is no buffer from Tx.
And then, Rx will put buffer from mempool as usual for rearm.
>
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port)
> > {
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental
> > int rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> > struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be
> > +re-armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + * It is assumed that the buffers have completed transmission belong to
> the
> > + * mempool used at the receive side, and have refcnt = 1.
>
> I think it is possible to avoid such limitations, but implementation will be less
> optimized - more checks.
[Feifei] For the first limitation: Rx and Tx buffers should be from the same mempool.
If we want to check this, we will add a check for each packet in the data-plane, this will
significantly reduce performance. So, it is better to note this for users rather than adding
check code.
For the second limitation: refcnt = 1. We have now add code to support 'refcnt = 1' in direct-rearm
mode, so this note can be removed in the next version.
>
> > + *
> > + * @param rx_port_id
> > + * Port identifying the receive side.
> > + * @param rx_queue_id
> > + * The index of the receive queue identifying the receive side.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + * Port identifying the transmit side.
>
> I guess there is an assumption that Rx and Tx ports are serviced by the same
> driver. If so and if it is an API limitation, ethdev layer must check it.
[Feifei] I agree with this. For the check that Rx and Tx port should be the same driver,
I will add check for this.
>
> > + * @param tx_queue_id
> > + * The index of the transmit queue identifying the transmit side.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + *
> > + * @return
> > + * - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id);
> > +
> > /**
> > * @warning
> > * @b EXPERIMENTAL: this structure may change without prior notice.
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 20391ab29e..68d664498c 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -279,6 +279,7 @@ EXPERIMENTAL {
> > rte_flow_async_action_handle_create;
> > rte_flow_async_action_handle_destroy;
> > rte_flow_async_action_handle_update;
> > + rte_eth_direct_rxrearm_map;
> > };
> >
> > INTERNAL {
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-21 14:57 ` Stephen Hemminger
@ 2022-04-29 6:35 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2022-04-29 6:35 UTC (permalink / raw)
To: Stephen Hemminger
Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dev, nd,
Honnappa Nagarahalli, Ruifeng Wang, nd
> -----邮件原件-----
> 发件人: Stephen Hemminger <stephen@networkplumber.org>
> 发送时间: Thursday, April 21, 2022 10:58 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; dev@dpdk.org; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
>
> On Wed, 20 Apr 2022 16:16:48 +0800
> Feifei Wang <feifei.wang2@arm.com> wrote:
>
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> > lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> > lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> > lib/ethdev/version.map | 1 +
> > 4 files changed, 61 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 69d9dc21d8..22022f6da9 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev, typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev
> *dev,
> > uint16_t rx_queue_id);
> >
> > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> rte_eth_dev *dev,
> > + uint16_t queue_id);
> > +
> > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > + uint16_t rx_queue_id,
> > + uint16_t tx_port_id,
> > + uint16_t tx_queue_id);
> > +
> > /** @internal Release memory resources allocated by given Rx/Tx
> > queue. */ typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> > uint16_t queue_id);
> > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> > /** Disable Rx queue interrupt */
> > eth_rx_disable_intr_t rx_queue_intr_disable;
> >
> > + /** Enable Rx queue direct rearm mode */
> > + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > + /** Map Rx/Tx queue for direct rearm mode */
> > + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> > +
> > eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx
> queue */
> > eth_queue_release_t tx_queue_release; /**< Release Tx queue
> */
> > eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs
> */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 29a3d80466..8e6f0284f4 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> port_id, uint16_t tx_queue_id,
> > return eth_err(port_id, ret);
> > }
> >
> > +int
> > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id) {
> > + struct rte_eth_dev *dev;
> > +
> > + dev = &rte_eth_devices[rx_port_id];
> > + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> rx_queue_id);
> > + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > + tx_port_id, tx_queue_id);
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_eth_hairpin_bind(uint16_t tx_port, uint16_t rx_port) { diff
> > --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..4a431fcbed 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -5190,6 +5190,37 @@ __rte_experimental int
> > rte_eth_dev_hairpin_capability_get(uint16_t port_id,
> > struct rte_eth_hairpin_cap *cap);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Enable direct re-arm mode. In this mode the RX queue will be
> > +re-armed using
> > + * buffers that have completed transmission on the transmit side.
> > + *
> > + * @note
> > + * It is assumed that the buffers have completed transmission belong to
> the
> > + * mempool used at the receive side, and have refcnt = 1.
> > + *
> > + * @param rx_port_id
> > + * Port identifying the receive side.
> > + * @param rx_queue_id
> > + * The index of the receive queue identifying the receive side.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + * Port identifying the transmit side.
> > + * @param tx_queue_id
> > + * The index of the transmit queue identifying the transmit side.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + *
> > + * @return
> > + * - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id);
>
> Just looking at this.
>
> Why is this done via API call and not a flag as part of the receive config?
> All the other offload and configuration happens via dev config.
> Doing it this way doesn't follow the existing ethdev model.
[Feifei] Agree with this. I will remove direct-rearm enable function and
use "rxq->offload" bit to enable it.
For rte_eth_direct_rxrearm_map, I think it is necessary for users to call it to map Rx/Tx queue.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-20 10:50 ` Jerin Jacob
@ 2022-05-02 3:09 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2022-05-02 3:09 UTC (permalink / raw)
To: Jerin Jacob
Cc: thomas, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella, dpdk-dev,
nd, Honnappa Nagarahalli, Ruifeng Wang, nd
> -----邮件原件-----
> 发件人: Jerin Jacob <jerinjacobk@gmail.com>
> 发送时间: Wednesday, April 20, 2022 6:50 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: thomas@monjalon.net; Ferruh Yigit <ferruh.yigit@intel.com>; Andrew
> Rybchenko <andrew.rybchenko@oktetlabs.ru>; Ray Kinsella
> <mdr@ashroe.eu>; dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
>
> On Wed, Apr 20, 2022 at 1:47 PM Feifei Wang <feifei.wang2@arm.com>
> wrote:
> >
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
>
> > + *
> > + * @return
> > + * - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > + uint16_t tx_port_id, uint16_t
> > +tx_queue_id);
>
> Won't existing rte_eth_hairpin_* APIs work to achieve the same?
[Feifei] Thanks for the comment. Look at the hairpin feature which is enabled in MLX5 driver.
I think the most important difference is that hairpin just re-directs the packet from the Rx queue
to Tx queue in the same port, and Rx/Tx queue just can record the peer queue id.
For direct rearm, it can map Rx queue to the Tx queue which are from different ports. And this needs
Rx queue records paired port id and queue id.
Furthermore, hairpin needs to set up new hairpin queue and then it can bind Rx queue to Tx queue.
and direct-rearm just can use normal queue to map. This is due to direct rearm needs used buffers and
it doesn't care about packet.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-21 6:40 ` Morten Brørup
@ 2022-05-10 22:01 ` Honnappa Nagarahalli
2022-05-11 7:17 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-10 22:01 UTC (permalink / raw)
To: Morten Brørup, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd
(apologies for the late response, this one slipped my mind)
Appreciate if others could weigh their opinions.
<snip>
>
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, 21 April 2022 04.35
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Wednesday, 20 April 2022 10.17
> > > >
> > > > Enable direct rearm mode. The mapping is decided in the data plane
> > > > based on the first packet received.
> > >
> > > I usually don't care much about l3fwd, but putting configuration
> > changes in the
> > > fast path is just wrong!
> > I would say it depends. In this case the cycles consumed by the API
> > are very less and configuration data is very small and is already in
> > the cache as PMD has accessed the same data structure.
> >
> > If the configuration needs more cycles than a typical (depending on
> > the
> > application) data plane packet processing needs or brings in enormous
> > amount of data in to the cache, it should not be done on the data
> > plane.
> >
>
> As a matter of principle, configuration changes should be done outside the fast
> path.
>
> If we allow an exception for this feature, it will set a bad precedent about
> where to put configuration code.
I think there are other examples though not exactly the same. For ex: the seqlock, we cannot have a scheduled out writer while holding the lock. But, it was mentioned that this can be over come easily by running the writer on an isolated core (which to me breaks some principles).
>
> > >
> > > Also, l3fwd is often used for benchmarking, and this small piece of
> > code in the
> > > fast path will affect benchmark results (although only very little).
> > We do not see any impact on the performance numbers. The reason for
> > putting in the data plane was it covers wider use case in this L3fwd
> > application. If the app were to be simple, the configuration could be
> > done from the control plane. Unfortunately, the performance of L3fwd
> > application matters.
> >
>
> Let's proceed down that path for the sake of discussion... Then the fast path is
> missing runtime verification that all preconditions for using remapping are
> present at any time.
Agree, few checks (ensuring that TX and RX buffers are from the same pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
We will add these, it is possible to add these checks outside the packet processing loop.
>
> > >
> > > Please move it out of the fast path.
>
> BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to enable
> the feature.
>
> And finally, this feature should be disabled by default, and only enabled by a
> command line parameter or similar. Otherwise, future l3fwd NIC performance
> reports will provide misleading performance results, if the feature is utilized.
> Application developers, when comparing NIC performance results, don't care
> about the performance for this unique use case; they care about the
> performance for the generic use case.
>
I think this feature is similar to fast free feature (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other thread. It should be handled similar to how fast free feature is handled.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-04-29 6:28 ` 回复: " Feifei Wang
@ 2022-05-10 22:49 ` Honnappa Nagarahalli
2022-06-03 10:19 ` Andrew Rybchenko
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-10 22:49 UTC (permalink / raw)
To: Feifei Wang, Andrew Rybchenko, thomas, Ferruh Yigit, Ray Kinsella
Cc: dev, nd, Ruifeng Wang, nd
<snip>
> >
> > On 4/20/22 11:16, Feifei Wang wrote:
> > > Add API for enabling direct rearm mode and for mapping RX and TX
> > > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > > lib/ethdev/ethdev_driver.h | 15 +++++++++++++++
> > > lib/ethdev/rte_ethdev.c | 14 ++++++++++++++
> > > lib/ethdev/rte_ethdev.h | 31 +++++++++++++++++++++++++++++++
> > > lib/ethdev/version.map | 1 +
> > > 4 files changed, 61 insertions(+)
> > >
> > > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > > index 69d9dc21d8..22022f6da9 100644
> > > --- a/lib/ethdev/ethdev_driver.h
> > > +++ b/lib/ethdev/ethdev_driver.h
> > > @@ -485,6 +485,16 @@ typedef int (*eth_rx_enable_intr_t)(struct
> > rte_eth_dev *dev,
> > > typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> > > uint16_t rx_queue_id);
> > >
> > > +/** @internal Enable direct rearm of a receive queue of an Ethernet
> > > +device. */ typedef int (*eth_rx_direct_rearm_enable_t)(struct
> > rte_eth_dev *dev,
> > > + uint16_t queue_id);
> > > +
> > > +/**< @internal map Rx/Tx queue of direct rearm mode */ typedef int
> > > +(*eth_rx_direct_rearm_map_t)(struct rte_eth_dev *dev,
> > > + uint16_t rx_queue_id,
> > > + uint16_t tx_port_id,
> > > + uint16_t tx_queue_id);
> > > +
> > > /** @internal Release memory resources allocated by given Rx/Tx queue.
> > */
> > > typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> > > uint16_t queue_id);
> > > @@ -1152,6 +1162,11 @@ struct eth_dev_ops {
> > > /** Disable Rx queue interrupt */
> > > eth_rx_disable_intr_t rx_queue_intr_disable;
> > >
> > > + /** Enable Rx queue direct rearm mode */
> > > + eth_rx_direct_rearm_enable_t rx_queue_direct_rearm_enable;
> > > + /** Map Rx/Tx queue for direct rearm mode */
> > > + eth_rx_direct_rearm_map_t rx_queue_direct_rearm_map;
> > > +
> > > eth_tx_queue_setup_t tx_queue_setup;/**< Set up device Tx
> > queue */
> > > eth_queue_release_t tx_queue_release; /**< Release Tx queue
> > */
> > > eth_tx_done_cleanup_t tx_done_cleanup;/**< Free Tx ring mbufs
> > */
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 29a3d80466..8e6f0284f4 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -2139,6 +2139,20 @@ rte_eth_tx_hairpin_queue_setup(uint16_t
> > port_id, uint16_t tx_queue_id,
> > > return eth_err(port_id, ret);
> > > }
> > >
> > > +int
> > > +rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t rx_queue_id,
> > > + uint16_t tx_port_id, uint16_t tx_queue_id) {
> > > + struct rte_eth_dev *dev;
> > > +
> > > + dev = &rte_eth_devices[rx_port_id];
> >
> > I think it is rather control path. So:
> > We need standard checks that rx_port_id is valid.
> > tx_port_id must be checked as well.
> > rx_queue_id and tx_queue_id must be checked to be in the rate.
> [Feifei] You are right, I will add check for these.
>
> >
> > > + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
> > rx_queue_id);
> > > + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
> > > + tx_port_id, tx_queue_id);
> >
> > We must check that function pointers are not NULL as usual.
> > Return values must be checked.
> [Feifei] I agree with this, The check for pointer and return value will be added
>
> > Isn't is safe to setup map and than enable.
> > Otherwise we definitely need disable.
> [Feifei] I will change code that map first and then set 'rxq->offload' to enable
> direct-rearm mode.
>
> > Also, what should happen on Tx port unplug? How to continue if we
> > still have Rx port up and running?
> [Feifei] For direct rearm mode, if Tx port unplug, it means there is no buffer
> from Tx.
> And then, Rx will put buffer from mempool as usual for rearm.
Andrew, when you say 'TX port unplug', do you mean the 'rte_eth_dev_tx_queue_stop' is called? Is calling 'rte_eth_dev_tx_queue_stop' allowed when the device is running?
>
> >
> > > +
> > > + return 0;
> > > +}
> > > +
<snip>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-05-10 22:01 ` Honnappa Nagarahalli
@ 2022-05-11 7:17 ` Morten Brørup
0 siblings, 0 replies; 145+ messages in thread
From: Morten Brørup @ 2022-05-11 7:17 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang; +Cc: dev, nd, Ruifeng Wang, nd
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 11 May 2022 00.02
>
> (apologies for the late response, this one slipped my mind)
>
> Appreciate if others could weigh their opinions.
>
> <snip>
> >
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Thursday, 21 April 2022 04.35
> > > >
> > > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > > Sent: Wednesday, 20 April 2022 10.17
> > > > >
> > > > > Enable direct rearm mode. The mapping is decided in the data
> plane
> > > > > based on the first packet received.
> > > >
> > > > I usually don't care much about l3fwd, but putting configuration
> > > changes in the
> > > > fast path is just wrong!
> > > I would say it depends. In this case the cycles consumed by the API
> > > are very less and configuration data is very small and is already
> in
> > > the cache as PMD has accessed the same data structure.
> > >
> > > If the configuration needs more cycles than a typical (depending on
> > > the
> > > application) data plane packet processing needs or brings in
> enormous
> > > amount of data in to the cache, it should not be done on the data
> > > plane.
> > >
> >
> > As a matter of principle, configuration changes should be done
> outside the fast
> > path.
> >
> > If we allow an exception for this feature, it will set a bad
> precedent about
> > where to put configuration code.
> I think there are other examples though not exactly the same. For ex:
> the seqlock, we cannot have a scheduled out writer while holding the
> lock. But, it was mentioned that this can be over come easily by
> running the writer on an isolated core (which to me breaks some
> principles).
Referring to a bad example (which breaks some principles) does not change my opinion. ;-)
>
> >
> > > >
> > > > Also, l3fwd is often used for benchmarking, and this small piece
> of
> > > code in the
> > > > fast path will affect benchmark results (although only very
> little).
> > > We do not see any impact on the performance numbers. The reason for
> > > putting in the data plane was it covers wider use case in this
> L3fwd
> > > application. If the app were to be simple, the configuration could
> be
> > > done from the control plane. Unfortunately, the performance of
> L3fwd
> > > application matters.
> > >
> >
> > Let's proceed down that path for the sake of discussion... Then the
> fast path is
> > missing runtime verification that all preconditions for using
> remapping are
> > present at any time.
> Agree, few checks (ensuring that TX and RX buffers are from the same
> pool, ensuring tx_rs_thresh is same as RX rearm threshold) are missing.
> We will add these, it is possible to add these checks outside the
> packet processing loop.
>
> >
> > > >
> > > > Please move it out of the fast path.
> >
> > BTW, this patch does not call the rte_eth_direct_rxrearm_enable() to
> enable
> > the feature.
> >
> > And finally, this feature should be disabled by default, and only
> enabled by a
> > command line parameter or similar. Otherwise, future l3fwd NIC
> performance
> > reports will provide misleading performance results, if the feature
> is utilized.
> > Application developers, when comparing NIC performance results, don't
> care
> > about the performance for this unique use case; they care about the
> > performance for the generic use case.
> >
> I think this feature is similar to fast free feature
> (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) as you have mentioned in the other
> thread. It should be handled similar to how fast free feature is
> handled.
I agree with this comparison.
Quickly skimming l3fwd/main.c reveals that RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is used without checking preconditions, and thus might be buggy. E.g. what happens when the NICs support RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE, and l3fwd is run with the "per-port-pool" command line option? Obviously, the "direct rearm" patch should not be punished because of bugs in similar features ("fast free"). But it is not a valid reason to allow similar bugs. You mentioned above that precondition checking will be added, so pardon me for ranting a bit here. ;-)
Furthermore, if using l3fwd for NIC performance reports, I find the results misleading if application specific optimizations are used without mentioning it in the report. This applies to both "fast free" and "direct rearm" optimizations - they only work in specific application scenarios, and thus the l3fwd NIC performance test should be run without these optimization, or at least mention that the report only covers these specific applications. Which is why I prefer that such optimizations must be explicitly enabled through a command line parameter, and not used in testing for official NIC performance reports. Taking one step back, the real problem here is that an *example* application is used for NIC performance testing, and this is the main reason for my performance related objections. I should probably object to using l3fwd for NIC performance testing instead.
I don't feel strongly about l3fwd, so I will not object to the l3fwd patch. Just providing some feedback. :-)
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 2/5] net/i40e: enable direct rearm mode
2022-04-20 8:16 ` [PATCH v1 2/5] net/i40e: enable direct rearm mode Feifei Wang
@ 2022-05-11 22:28 ` Konstantin Ananyev
0 siblings, 0 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:28 UTC (permalink / raw)
To: Feifei Wang, Beilei Xing, Bruce Richardson, Konstantin Ananyev,
Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli
> For i40e driver, enable direct re-arm mode. This patch supports the case
> of mapping Rx/Tx queues from the same single lcore.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> drivers/net/i40e/i40e_rxtx.h | 4 +
> drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
> drivers/net/i40e/i40e_rxtx_vec_avx2.c | 14 +-
> drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 141 ++++++++++++-
> drivers/net/i40e/i40e_rxtx_vec_sse.c | 170 ++++++++++++++-
> 6 files changed, 839 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
> index 5e6eecc501..1fdf4305f4 100644
> --- a/drivers/net/i40e/i40e_rxtx.h
> +++ b/drivers/net/i40e/i40e_rxtx.h
> @@ -102,6 +102,8 @@ struct i40e_rx_queue {
>
> uint16_t rxrearm_nb; /**< number of remaining to be re-armed */
> uint16_t rxrearm_start; /**< the idx we start the re-arming from */
> + uint16_t direct_rxrearm_port; /** device TX port ID for direct re-arm mode */
> + uint16_t direct_rxrearm_queue; /** TX queue index for direct re-arm mode */
> uint64_t mbuf_initializer; /**< value to init mbufs */
>
> uint16_t port_id; /**< device port ID */
> @@ -121,6 +123,8 @@ struct i40e_rx_queue {
> uint16_t rx_using_sse; /**<flag indicate the usage of vPMD for rx */
> uint8_t dcb_tc; /**< Traffic class of rx queue */
> uint64_t offloads; /**< Rx offload flags of RTE_ETH_RX_OFFLOAD_* */
> + /**< 0 if direct re-arm mode disabled, 1 when enabled */
> + bool direct_rxrearm_enable;
> const struct rte_memzone *mz;
> };
>
> diff --git a/drivers/net/i40e/i40e_rxtx_common_avx.h b/drivers/net/i40e/i40e_rxtx_common_avx.h
> index cfc1e63173..a742723e07 100644
> --- a/drivers/net/i40e/i40e_rxtx_common_avx.h
> +++ b/drivers/net/i40e/i40e_rxtx_common_avx.h
> @@ -209,6 +209,275 @@ i40e_rxq_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
> /* Update the tail pointer on the NIC */
> I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> }
> +
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm_common(struct i40e_rx_queue *rxq, __rte_unused bool avx512)
> +{
> + struct rte_eth_dev *dev;
> + struct i40e_tx_queue *txq;rivers/net/i40e/i40e_rxtx_common_avx.h
> + volatile union i40e_rx_desc *rxdp;
> + struct i40e_tx_entry *txep;
> + struct i40e_rx_entry *rxep;
> + struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
> + uint16_t tx_port_id, tx_queue_id;
> + uint16_t rx_id;
> + uint16_t i, n;
> + uint16_t nb_rearm = 0;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + tx_port_id = rxq->direct_rxrearm_port;
> + tx_queue_id = rxq->direct_rxrearm_queue;
> + dev = &rte_eth_devices[tx_port_id];
> + txq = dev->data->tx_queues[tx_queue_id];
> +
> + /* check Rx queue is able to take in the whole
> + * batch of free mbufs from Tx queue
> + */
> + if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> + goto mempool_bulk;
> + }
> +
> + if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
> + goto mempool_bulk;
I think all these checks (is this mode can be enabled) should be done at
config phase, not at data-path.
> +
> + n = txq->tx_rs_thresh;
> +
> + /* first buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1)
> + */
> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
It really looks bad that RX function acesses and modifies TXQ data
directly. Would be much better to hide TXD checking/manipulation into a
separate TXQ function (txq_mbuf() or so) that RX path can invoke.
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* directly put mbufs from Tx to Rx,
> + * and initialize the mbufs in vector
> + */
> + for (i = 0; i < n; i++)
> + rxep[i].mbuf = txep[i].mbuf;
> + } else {
> + for (i = 0; i < n; i++) {
> + m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> + /* ensure each Tx freed buffer is valid */
> + if (m[i] != NULL)
> + nb_rearm++;
> + }
> +
> + if (nb_rearm != n) {
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
So if nb_rearm != 0 what would happen with mbufs collected in m[]?
Are you just dropping/forgetting them?
> +
> + goto mempool_bulk;
> + } else {
> + for (i = 0; i < n; i++)
> + rxep[i].mbuf = m[i];
> + }
> + }
> +
> + /* update counters for Tx */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> + } else {
I suppose the chunk of code below is just a copy&paste of
exising i40e_rxq_direct_rearm_common()?
If so, no point to duplicate it, better to just invoke it here
(I presume a bit of re-factoring) would be need for that.
Pretty much same thoughts for other rearm functions below.
> +mempool_bulk:
> + /* if TX did not free bufs into Rx sw-ring,
> + * get new bufs from mempool
> + */
> + n = RTE_I40E_RXQ_REARM_THRESH;
> +
> + /* Pull 'n' more MBUFs into the software ring */
> + if (rte_mempool_get_bulk(rxq->mp,
> + (void *)rxep,
> + RTE_I40E_RXQ_REARM_THRESH) < 0) {
> + if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
> + rxq->nb_rx_desc) {
> + __m128i dma_addr0;
> + dma_addr0 = _mm_setzero_si128();
> + for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> + rxep[i].mbuf = &rxq->fake_mbuf;
> + _mm_store_si128((__m128i *)&rxdp[i].read,
> + dma_addr0);
> + }
> + }
> + rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> + RTE_I40E_RXQ_REARM_THRESH;
> + return;
> + }
> + }
> +
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> + struct rte_mbuf *mb0, *mb1;
> + __m128i dma_addr0, dma_addr1;
> + __m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
> + RTE_PKTMBUF_HEADROOM);
> + /* Initialize the mbufs in vector, process 2 mbufs in one loop */
> + for (i = 0; i < n; i += 2, rxep += 2) {
> + __m128i vaddr0, vaddr1;
> +
> + mb0 = rxep[0].mbuf;
> + mb1 = rxep[1].mbuf;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> + vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> + dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
> +
> + /* add headroom to pa values */
> + dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> + dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
> + }
> +#else
> +#ifdef __AVX512VL__
> + if (avx512) {
> + struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
> + struct rte_mbuf *mb4, *mb5, *mb6, *mb7;
> + __m512i dma_addr0_3, dma_addr4_7;
> + __m512i hdr_room = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
> + /* Initialize the mbufs in vector, process 8 mbufs in one loop */
> + for (i = 0; i < n; i += 8, rxep += 8, rxdp += 8) {
> + __m128i vaddr0, vaddr1, vaddr2, vaddr3;
> + __m128i vaddr4, vaddr5, vaddr6, vaddr7;
> + __m256i vaddr0_1, vaddr2_3;
> + __m256i vaddr4_5, vaddr6_7;
> + __m512i vaddr0_3, vaddr4_7;
> +
> + mb0 = rxep[0].mbuf;
> + mb1 = rxep[1].mbuf;
> + mb2 = rxep[2].mbuf;
> + mb3 = rxep[3].mbuf;
> + mb4 = rxep[4].mbuf;
> + mb5 = rxep[5].mbuf;
> + mb6 = rxep[6].mbuf;
> + mb7 = rxep[7].mbuf;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> + vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> + vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
> + vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
> + vaddr4 = _mm_loadu_si128((__m128i *)&mb4->buf_addr);
> + vaddr5 = _mm_loadu_si128((__m128i *)&mb5->buf_addr);
> + vaddr6 = _mm_loadu_si128((__m128i *)&mb6->buf_addr);
> + vaddr7 = _mm_loadu_si128((__m128i *)&mb7->buf_addr);
> +
> + /**
> + * merge 0 & 1, by casting 0 to 256-bit and inserting 1
> + * into the high lanes. Similarly for 2 & 3, and so on.
> + */
> + vaddr0_1 =
> + _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr0),
> + vaddr1, 1);
> + vaddr2_3 =
> + _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr2),
> + vaddr3, 1);
> + vaddr4_5 =
> + _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr4),
> + vaddr5, 1);
> + vaddr6_7 =
> + _mm256_inserti128_si256(_mm256_castsi128_si256(vaddr6),
> + vaddr7, 1);
> + vaddr0_3 =
> + _mm512_inserti64x4(_mm512_castsi256_si512(vaddr0_1),
> + vaddr2_3, 1);
> + vaddr4_7 =
> + _mm512_inserti64x4(_mm512_castsi256_si512(vaddr4_5),
> + vaddr6_7, 1);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0_3 = _mm512_unpackhi_epi64(vaddr0_3, vaddr0_3);
> + dma_addr4_7 = _mm512_unpackhi_epi64(vaddr4_7, vaddr4_7);
> +
> + /* add headroom to pa values */
> + dma_addr0_3 = _mm512_add_epi64(dma_addr0_3, hdr_room);
> + dma_addr4_7 = _mm512_add_epi64(dma_addr4_7, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm512_store_si512((__m512i *)&rxdp->read, dma_addr0_3);
> + _mm512_store_si512((__m512i *)&(rxdp + 4)->read, dma_addr4_7);
> + }
> + } else {
> +#endif /* __AVX512VL__*/
> + struct rte_mbuf *mb0, *mb1, *mb2, *mb3;
> + __m256i dma_addr0_1, dma_addr2_3;
> + __m256i hdr_room = _mm256_set1_epi64x(RTE_PKTMBUF_HEADROOM);
> + /* Initialize the mbufs in vector, process 4 mbufs in one loop */
> + for (i = 0; i < n; i += 4, rxep += 4, rxdp += 4) {
> + __m128i vaddr0, vaddr1, vaddr2, vaddr3;
> + __m256i vaddr0_1, vaddr2_3;
> +
> + mb0 = rxep[0].mbuf;
> + mb1 = rxep[1].mbuf;
> + mb2 = rxep[2].mbuf;
> + mb3 = rxep[3].mbuf;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> + vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> + vaddr2 = _mm_loadu_si128((__m128i *)&mb2->buf_addr);
> + vaddr3 = _mm_loadu_si128((__m128i *)&mb3->buf_addr);
> +
> + /**
> + * merge 0 & 1, by casting 0 to 256-bit and inserting 1
> + * into the high lanes. Similarly for 2 & 3
> + */
> + vaddr0_1 = _mm256_inserti128_si256
> + (_mm256_castsi128_si256(vaddr0), vaddr1, 1);
> + vaddr2_3 = _mm256_inserti128_si256
> + (_mm256_castsi128_si256(vaddr2), vaddr3, 1);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0_1 = _mm256_unpackhi_epi64(vaddr0_1, vaddr0_1);
> + dma_addr2_3 = _mm256_unpackhi_epi64(vaddr2_3, vaddr2_3);
> +
> + /* add headroom to pa values */
> + dma_addr0_1 = _mm256_add_epi64(dma_addr0_1, hdr_room);
> + dma_addr2_3 = _mm256_add_epi64(dma_addr2_3, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm256_store_si256((__m256i *)&rxdp->read, dma_addr0_1);
> + _mm256_store_si256((__m256i *)&(rxdp + 2)->read, dma_addr2_3);
> + }
> + }
> +
> +#endif
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += n;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> + if (!rxq->rxrearm_start)
> + rx_id = rxq->nb_rx_desc - 1;
> + else
> + rx_id = rxq->rxrearm_start - 1;
> + }
> +
> + rxq->rxrearm_nb -= n;
> +
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> +}
> #endif /* __AVX2__*/
>
> #endif /*_I40E_RXTX_COMMON_AVX_H_*/
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> index c73b2a321b..fcb7ba0273 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c
> @@ -25,6 +25,12 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
> return i40e_rxq_rearm_common(rxq, false);
> }
>
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> + return i40e_rxq_direct_rearm_common(rxq, false);
> +}
> +
> #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> /* Handles 32B descriptor FDIR ID processing:
> * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
> @@ -128,8 +134,12 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
> /* See if we need to rearm the RX queue - gives the prefetch a bit
> * of time to act
> */
> - if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> - i40e_rxq_rearm(rxq);
> + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> + if (rxq->direct_rxrearm_enable)
> + i40e_rxq_direct_rearm(rxq);
> + else
> + i40e_rxq_rearm(rxq);
> + }
>
> /* Before we start moving massive data around, check to see if
> * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx512.c b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> index 2e8a3f0df6..d967095edc 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_avx512.c
> @@ -21,6 +21,12 @@
>
> #define RTE_I40E_DESCS_PER_LOOP_AVX 8
>
> +enum i40e_direct_rearm_type_value {
> + I40E_DIRECT_REARM_TYPE_NORMAL = 0x0,
> + I40E_DIRECT_REARM_TYPE_FAST_FREE = 0x1,
> + I40E_DIRECT_REARM_TYPE_PRE_FREE = 0x2,
> +};
> +
> static __rte_always_inline void
> i40e_rxq_rearm(struct i40e_rx_queue *rxq)
> {
> @@ -150,6 +156,241 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
> I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> }
>
> +static __rte_always_inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> + struct rte_eth_dev *dev;
> + struct i40e_tx_queue *txq;
> + volatile union i40e_rx_desc *rxdp;
> + struct i40e_vec_tx_entry *txep;
> + struct i40e_rx_entry *rxep;
> + struct rte_mbuf *m[RTE_I40E_RXQ_REARM_THRESH];
> + uint16_t tx_port_id, tx_queue_id;
> + uint16_t rx_id;
> + uint16_t i, n;
> + uint16_t j = 0;
> + uint16_t nb_rearm = 0;
> + enum i40e_direct_rearm_type_value type;
> + struct rte_mempool_cache *cache = NULL;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + tx_port_id = rxq->direct_rxrearm_port;
> + tx_queue_id = rxq->direct_rxrearm_queue;
> + dev = &rte_eth_devices[tx_port_id];
> + txq = dev->data->tx_queues[tx_queue_id];
> +
> + /* check Rx queue is able to take in the whole
> + * batch of free mbufs from Tx queue
> + */
> + if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> + goto mempool_bulk;
> + }
> +
> + if (txq->tx_rs_thresh != RTE_I40E_RXQ_REARM_THRESH)
> + goto mempool_bulk;
> +
> + n = txq->tx_rs_thresh;
> +
> + /* first buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1)
> + */
> + txep = (void *)txq->sw_ring;
> + txep += txq->tx_next_dd - (n - 1);
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* directly put mbufs from Tx to Rx */
> + uint32_t copied = 0;
> + /* n is multiple of 32 */
> + while (copied < n) {
> + const __m512i a = _mm512_load_si512(&txep[copied]);
> + const __m512i b = _mm512_load_si512(&txep[copied + 8]);
> + const __m512i c = _mm512_load_si512(&txep[copied + 16]);
> + const __m512i d = _mm512_load_si512(&txep[copied + 24]);
> +
> + _mm512_storeu_si512(&rxep[copied], a);
> + _mm512_storeu_si512(&rxep[copied + 8], b);
> + _mm512_storeu_si512(&rxep[copied + 16], c);
> + _mm512_storeu_si512(&rxep[copied + 24], d);
> + copied += 32;
> + }
> + type = I40E_DIRECT_REARM_TYPE_FAST_FREE;
> + } else {
> + for (i = 0; i < n; i++) {
> + m[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> + /* ensure each Tx freed buffer is valid */
> + if (m[i] != NULL)
> + nb_rearm++;
> + }
> +
> + if (nb_rearm != n) {
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> + goto mempool_bulk;
> + } else {
> + type = I40E_DIRECT_REARM_TYPE_PRE_FREE;
> + }
> + }
> +
> + /* update counters for Tx */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> + } else {
> +mempool_bulk:
> + cache = rte_mempool_default_cache(rxq->mp, rte_lcore_id());
> +
> + if (unlikely(!cache))
> + return i40e_rxq_rearm_common(rxq, true);
> +
> + n = RTE_I40E_RXQ_REARM_THRESH;
> +
> + /* We need to pull 'n' more MBUFs into the software ring from mempool
> + * We inline the mempool function here, so we can vectorize the copy
> + * from the cache into the shadow ring.
> + */
> +
> + if (cache->len < RTE_I40E_RXQ_REARM_THRESH) {
> + /* No. Backfill the cache first, and then fill from it */
> + uint32_t req = RTE_I40E_RXQ_REARM_THRESH + (cache->size -
> + cache->len);
> +
> + /* How many do we require
> + * i.e. number to fill the cache + the request
> + */
> + int ret = rte_mempool_ops_dequeue_bulk(rxq->mp,
> + &cache->objs[cache->len], req);
> + if (ret == 0) {
> + cache->len += req;
> + } else {
> + if (rxq->rxrearm_nb + RTE_I40E_RXQ_REARM_THRESH >=
> + rxq->nb_rx_desc) {
> + __m128i dma_addr0;
> +
> + dma_addr0 = _mm_setzero_si128();
> + for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> + rxep[i].mbuf = &rxq->fake_mbuf;
> + _mm_store_si128
> + ((__m128i *)&rxdp[i].read,
> + dma_addr0);
> + }
> + }
> + rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> + RTE_I40E_RXQ_REARM_THRESH;
> + return;
> + }
> + }
> +
> + type = I40E_DIRECT_REARM_TYPE_NORMAL;
> + }
> +
> + const __m512i iova_offsets = _mm512_set1_epi64
> + (offsetof(struct rte_mbuf, buf_iova));
> + const __m512i headroom = _mm512_set1_epi64(RTE_PKTMBUF_HEADROOM);
> +
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> + /* to shuffle the addresses to correct slots. Values 4-7 will contain
> + * zeros, so use 7 for a zero-value.
> + */
> + const __m512i permute_idx = _mm512_set_epi64(7, 7, 3, 1, 7, 7, 2, 0);
> +#else
> + const __m512i permute_idx = _mm512_set_epi64(7, 3, 6, 2, 5, 1, 4, 0);
> +#endif
> +
> + __m512i mbuf_ptrs;
> +
> + /* Initialize the mbufs in vector, process 8 mbufs in one loop, taking
> + * from mempool cache and populating both shadow and HW rings
> + */
> + for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH / 8; i++) {
> + switch (type) {
> + case I40E_DIRECT_REARM_TYPE_FAST_FREE:
> + mbuf_ptrs = _mm512_loadu_si512(rxep);
> + break;
> + case I40E_DIRECT_REARM_TYPE_PRE_FREE:
> + mbuf_ptrs = _mm512_loadu_si512(&m[j]);
> + _mm512_store_si512(rxep, mbuf_ptrs);
> + j += 8;
> + break;
> + case I40E_DIRECT_REARM_TYPE_NORMAL:
> + mbuf_ptrs = _mm512_loadu_si512
> + (&cache->objs[cache->len - 8]);
> + _mm512_store_si512(rxep, mbuf_ptrs);
> + cache->len -= 8;
> + break;
> + }
> +
> + /* gather iova of mbuf0-7 into one zmm reg */
> + const __m512i iova_base_addrs = _mm512_i64gather_epi64
> + (_mm512_add_epi64(mbuf_ptrs, iova_offsets),
> + 0, /* base */
> + 1 /* scale */);
> + const __m512i iova_addrs = _mm512_add_epi64(iova_base_addrs,
> + headroom);
> +#ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> + const __m512i iovas0 = _mm512_castsi256_si512
> + (_mm512_extracti64x4_epi64(iova_addrs, 0));
> + const __m512i iovas1 = _mm512_castsi256_si512
> + (_mm512_extracti64x4_epi64(iova_addrs, 1));
> +
> + /* permute leaves desc 2-3 addresses in header address slots 0-1
> + * but these are ignored by driver since header split not
> + * enabled. Similarly for desc 4 & 5.
> + */
> + const __m512i desc_rd_0_1 = _mm512_permutexvar_epi64
> + (permute_idx, iovas0);
> + const __m512i desc_rd_2_3 = _mm512_bsrli_epi128(desc_rd_0_1, 8);
> +
> + const __m512i desc_rd_4_5 = _mm512_permutexvar_epi64
> + (permute_idx, iovas1);
> + const __m512i desc_rd_6_7 = _mm512_bsrli_epi128(desc_rd_4_5, 8);
> +
> + _mm512_store_si512((void *)rxdp, desc_rd_0_1);
> + _mm512_store_si512((void *)(rxdp + 2), desc_rd_2_3);
> + _mm512_store_si512((void *)(rxdp + 4), desc_rd_4_5);
> + _mm512_store_si512((void *)(rxdp + 6), desc_rd_6_7);
> +#else
> + /* permute leaves desc 4-7 addresses in header address slots 0-3
> + * but these are ignored by driver since header split not
> + * enabled.
> + */
> + const __m512i desc_rd_0_3 = _mm512_permutexvar_epi64
> + (permute_idx, iova_addrs);
> + const __m512i desc_rd_4_7 = _mm512_bsrli_epi128(desc_rd_0_3, 8);
> +
> + _mm512_store_si512((void *)rxdp, desc_rd_0_3);
> + _mm512_store_si512((void *)(rxdp + 4), desc_rd_4_7);
> +#endif
> + rxdp += 8, rxep += 8;
> + }
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += n;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> + if (!rxq->rxrearm_start)
> + rx_id = rxq->nb_rx_desc - 1;
> + else
> + rx_id = rxq->rxrearm_start - 1;
> + }
> +
> + rxq->rxrearm_nb -= n;
> +
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> +}
> +
> #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> /* Handles 32B descriptor FDIR ID processing:
> * rxdp: receive descriptor ring, required to load 2nd 16B half of each desc
> @@ -252,8 +493,12 @@ _recv_raw_pkts_vec_avx512(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
> /* See if we need to rearm the RX queue - gives the prefetch a bit
> * of time to act
> */
> - if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> - i40e_rxq_rearm(rxq);
> + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> + if (rxq->direct_rxrearm_enable)
> + i40e_rxq_direct_rearm(rxq);
> + else
> + i40e_rxq_rearm(rxq);
> + }
>
> /* Before we start moving massive data around, check to see if
> * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> index fa9e6582c5..dc78e3c90b 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> @@ -77,6 +77,139 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
> I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> }
>
> +static inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> + struct rte_eth_dev *dev;
> + struct i40e_tx_queue *txq;
> + volatile union i40e_rx_desc *rxdp;
> + struct i40e_tx_entry *txep;
> + struct i40e_rx_entry *rxep;
> + uint16_t tx_port_id, tx_queue_id;
> + uint16_t rx_id;
> + struct rte_mbuf *mb0, *mb1, *m;
> + uint64x2_t dma_addr0, dma_addr1;
> + uint64x2_t zero = vdupq_n_u64(0);
> + uint64_t paddr;
> + uint16_t i, n;
> + uint16_t nb_rearm = 0;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + tx_port_id = rxq->direct_rxrearm_port;
> + tx_queue_id = rxq->direct_rxrearm_queue;
> + dev = &rte_eth_devices[tx_port_id];
> + txq = dev->data->tx_queues[tx_queue_id];
> +
> + /* check Rx queue is able to take in the whole
> + * batch of free mbufs from Tx queue
> + */
> + if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> + goto mempool_bulk;
> + }
> +
> + n = txq->tx_rs_thresh;
> +
> + /* first buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1)
> + */
> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* directly put mbufs from Tx to Rx,
> + * and initialize the mbufs in vector
> + */
> + for (i = 0; i < n; i++, rxep++, txep++) {
> + rxep[0].mbuf = txep[0].mbuf;
> +
> + /* Initialize rxdp descs */
> + mb0 = txep[0].mbuf;
> +
> + paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr0 = vdupq_n_u64(paddr);
> + /* flush desc with pa dma_addr */
> + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> + }
> + } else {
> + for (i = 0; i < n; i++) {
> + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> + if (m != NULL) {
> + rxep[i].mbuf = m;
> +
> + /* Initialize rxdp descs */
> + paddr = m->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr0 = vdupq_n_u64(paddr);
> + /* flush desc with pa dma_addr */
> + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> + nb_rearm++;
> + }
> + }
> + n = nb_rearm;
> + }
> +
> + /* update counters for Tx */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> + } else {
> +mempool_bulk:
> + /* if TX did not free bufs into Rx sw-ring,
> + * get new bufs from mempool
> + */
> + n = RTE_I40E_RXQ_REARM_THRESH;
> + if (unlikely(rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0)) {
> + if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
> + for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> + rxep[i].mbuf = &rxq->fake_mbuf;
> + vst1q_u64((uint64_t *)&rxdp[i].read, zero);
> + }
> + }
> + rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed += n;
> + return;
> + }
> +
> + /* Initialize the mbufs in vector, process 2 mbufs in one loop */
> + for (i = 0; i < n; i += 2, rxep += 2) {
> + mb0 = rxep[0].mbuf;
> + mb1 = rxep[1].mbuf;
> +
> + paddr = mb0->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr0 = vdupq_n_u64(paddr);
> + /* flush desc with pa dma_addr */
> + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr0);
> +
> + paddr = mb1->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr1 = vdupq_n_u64(paddr);
> + /* flush desc with pa dma_addr */
> + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr1);
> + }
> + }
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += n;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> + if (!rxq->rxrearm_start)
> + rx_id = rxq->nb_rx_desc - 1;
> + else
> + rx_id = rxq->rxrearm_start - 1;
> + }
> +
> + rxq->rxrearm_nb -= n;
> +
> + rte_io_wmb();
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +}
> +
> #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> /* NEON version of FDIR mark extraction for 4 32B descriptors at a time */
> static inline uint32x4_t
> @@ -381,8 +514,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
> /* See if we need to rearm the RX queue - gives the prefetch a bit
> * of time to act
> */
> - if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> - i40e_rxq_rearm(rxq);
> + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> + if (rxq->direct_rxrearm_enable)
> + i40e_rxq_direct_rearm(rxq);
> + else
> + i40e_rxq_rearm(rxq);
> + }
>
> /* Before we start moving massive data around, check to see if
> * there is actually a packet available
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
> index 3782e8052f..b2f1ab2c8d 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
> @@ -89,6 +89,168 @@ i40e_rxq_rearm(struct i40e_rx_queue *rxq)
> I40E_PCI_REG_WC_WRITE(rxq->qrx_tail, rx_id);
> }
>
> +static inline void
> +i40e_rxq_direct_rearm(struct i40e_rx_queue *rxq)
> +{
> + struct rte_eth_dev *dev;
> + struct i40e_tx_queue *txq;
> + volatile union i40e_rx_desc *rxdp;
> + struct i40e_tx_entry *txep;
> + struct i40e_rx_entry *rxep;
> + uint16_t tx_port_id, tx_queue_id;
> + uint16_t rx_id;
> + struct rte_mbuf *mb0, *mb1, *m;
> + __m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
> + RTE_PKTMBUF_HEADROOM);
> + __m128i dma_addr0, dma_addr1;
> + __m128i vaddr0, vaddr1;
> + uint16_t i, n;
> + uint16_t nb_rearm = 0;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + tx_port_id = rxq->direct_rxrearm_port;
> + tx_queue_id = rxq->direct_rxrearm_queue;
> + dev = &rte_eth_devices[tx_port_id];
> + txq = dev->data->tx_queues[tx_queue_id];
> +
> + /* check Rx queue is able to take in the whole
> + * batch of free mbufs from Tx queue
> + */
> + if (rxq->rxrearm_nb > txq->tx_rs_thresh) {
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
> + goto mempool_bulk;
> + }
> +
> + n = txq->tx_rs_thresh;
> +
> + /* first buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1)
> + */
> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* directly put mbufs from Tx to Rx,
> + * and initialize the mbufs in vector
> + */
> + for (i = 0; i < n; i++, rxep++, txep++) {
> + rxep[0].mbuf = txep[0].mbuf;
> +
> + /* Initialize rxdp descs */
> + mb0 = txep[0].mbuf;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +
> + /* add headroom to pa values */
> + dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> + }
> + } else {
> + for (i = 0; i < n; i++) {
> + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> + if (m != NULL) {
> + rxep[i].mbuf = m;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&m->buf_addr);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> +
> + /* add headroom to pa values */
> + dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> + nb_rearm++;
> + }
> + }
> + n = nb_rearm;
> + }
> +
> + /* update counters for Tx */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> + } else {
> +mempool_bulk:
> + /* if TX did not free bufs into Rx sw-ring,
> + * get new bufs from mempool
> + */
> + n = RTE_I40E_RXQ_REARM_THRESH;
> + /* Pull 'n' more MBUFs into the software ring */
> + if (rte_mempool_get_bulk(rxq->mp, (void *)rxep, n) < 0) {
> + if (rxq->rxrearm_nb + n >= rxq->nb_rx_desc) {
> + dma_addr0 = _mm_setzero_si128();
> + for (i = 0; i < RTE_I40E_DESCS_PER_LOOP; i++) {
> + rxep[i].mbuf = &rxq->fake_mbuf;
> + _mm_store_si128((__m128i *)&rxdp[i].read,
> + dma_addr0);
> + }
> + }
> + rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed +=
> + RTE_I40E_RXQ_REARM_THRESH;
> + return;
> + }
> +
> + /* Initialize the mbufs in vector, process 2 mbufs in one loop */
> + for (i = 0; i < RTE_I40E_RXQ_REARM_THRESH; i += 2, rxep += 2) {
> + mb0 = rxep[0].mbuf;
> + mb1 = rxep[1].mbuf;
> +
> + /* load buf_addr(lo 64bit) and buf_iova(hi 64bit) */
> + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, buf_iova) !=
> + offsetof(struct rte_mbuf, buf_addr) + 8);
> + vaddr0 = _mm_loadu_si128((__m128i *)&mb0->buf_addr);
> + vaddr1 = _mm_loadu_si128((__m128i *)&mb1->buf_addr);
> +
> + /* convert pa to dma_addr hdr/data */
> + dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
> + dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
> +
> + /* add headroom to pa values */
> + dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
> + dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
> +
> + /* flush desc with pa dma_addr */
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
> + _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
> + }
> + }
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += n;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> + if (!rxq->rxrearm_start)
> + rx_id = rxq->nb_rx_desc - 1;
> + else
> + rx_id = rxq->rxrearm_start - 1;
> + }
> +
> + rxq->rxrearm_nb -= n;
> +
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +}
> +
> #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC
> /* SSE version of FDIR mark extraction for 4 32B descriptors at a time */
> static inline __m128i
> @@ -394,8 +556,12 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
> /* See if we need to rearm the RX queue - gives the prefetch a bit
> * of time to act
> */
> - if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH)
> - i40e_rxq_rearm(rxq);
> + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) {
> + if (rxq->direct_rxrearm_enable)
> + i40e_rxq_direct_rearm(rxq);
> + else
> + i40e_rxq_rearm(rxq);
> + }
>
> /* Before we start moving massive data around, check to see if
> * there is actually a packet available
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 4/5] net/i40e: add direct rearm mode internal API
2022-04-20 8:16 ` [PATCH v1 4/5] net/i40e: add direct rearm mode internal API Feifei Wang
@ 2022-05-11 22:31 ` Konstantin Ananyev
0 siblings, 0 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:31 UTC (permalink / raw)
To: Feifei Wang, Beilei Xing; +Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang
20/04/2022 09:16, Feifei Wang пишет:
> For direct rearm mode, add two internal functions.
>
> One is to enable direct rearm mode in Rx queue.
>
> The other is to map Tx queue with Rx queue to make Rx queue take
> buffers from the specific Tx queue.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> drivers/net/i40e/i40e_ethdev.c | 34 ++++++++++++++++++++++++++++++++++
> 1 file changed, 34 insertions(+)
>
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index 755786dc10..9e1a523bcc 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -369,6 +369,13 @@ static int i40e_dev_rx_queue_intr_enable(struct rte_eth_dev *dev,
> static int i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev,
> uint16_t queue_id);
>
> +static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
> + uint16_t queue_id);
> +static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id,
> + uint16_t tx_port_id,
> + uint16_t tx_queue_id);
> +
> static int i40e_get_regs(struct rte_eth_dev *dev,
> struct rte_dev_reg_info *regs);
>
> @@ -477,6 +484,8 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
> .rx_queue_setup = i40e_dev_rx_queue_setup,
> .rx_queue_intr_enable = i40e_dev_rx_queue_intr_enable,
> .rx_queue_intr_disable = i40e_dev_rx_queue_intr_disable,
> + .rx_queue_direct_rearm_enable = i40e_dev_rx_queue_direct_rearm_enable,
> + .rx_queue_direct_rearm_map = i40e_dev_rx_queue_direct_rearm_map,
> .rx_queue_release = i40e_dev_rx_queue_release,
> .tx_queue_setup = i40e_dev_tx_queue_setup,
> .tx_queue_release = i40e_dev_tx_queue_release,
> @@ -11108,6 +11117,31 @@ i40e_dev_rx_queue_intr_disable(struct rte_eth_dev *dev, uint16_t queue_id)
> return 0;
> }
>
> +static int i40e_dev_rx_queue_direct_rearm_enable(struct rte_eth_dev *dev,
> + uint16_t queue_id)
> +{
> + struct i40e_rx_queue *rxq;
> +
> + rxq = dev->data->rx_queues[queue_id];
> + rxq->direct_rxrearm_enable = 1;
> +
> + return 0;
> +}
> +
> +static int i40e_dev_rx_queue_direct_rearm_map(struct rte_eth_dev *dev,
> + uint16_t rx_queue_id, uint16_t tx_port_id,
> + uint16_t tx_queue_id)
> +{
> + struct i40e_rx_queue *rxq;
> +
> + rxq = dev->data->rx_queues[rx_queue_id];
> +
> + rxq->direct_rxrearm_port = tx_port_id;
> + rxq->direct_rxrearm_queue = tx_queue_id;
I don't think this function should not enable that mode blindly.
Instead, it needs to check first that all pre-conditions are met
(tx/rx threshold values are equal, etc.).
> +
> + return 0;
> +}
> +
> /**
> * This function is used to check if the register is valid.
> * Below is the valid registers list for X722 only:
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-04-20 8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
2022-04-20 10:10 ` Morten Brørup
@ 2022-05-11 22:33 ` Konstantin Ananyev
2022-05-27 11:28 ` Konstantin Ananyev
1 sibling, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 22:33 UTC (permalink / raw)
To: dev
20/04/2022 09:16, Feifei Wang пишет:
> Enable direct rearm mode. The mapping is decided in the data plane based
> on the first packet received.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> index bec22c44cd..38ffdf4636 100644
> --- a/examples/l3fwd/l3fwd_lpm.c
> +++ b/examples/l3fwd/l3fwd_lpm.c
> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
> unsigned lcore_id;
> uint64_t prev_tsc, diff_tsc, cur_tsc;
> int i, nb_rx;
> - uint16_t portid;
> + uint16_t portid, tx_portid;
> uint8_t queueid;
> struct lcore_conf *qconf;
> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>
> const uint16_t n_rx_q = qconf->n_rx_queue;
> const uint16_t n_tx_p = qconf->n_tx_port;
> + int direct_rearm_map[n_rx_q];
> +
> if (n_rx_q == 0) {
> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n", lcore_id);
> return 0;
> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>
> portid = qconf->rx_queue_list[i].port_id;
> queueid = qconf->rx_queue_list[i].queue_id;
> + direct_rearm_map[i] = 0;
> RTE_LOG(INFO, L3FWD,
> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
> lcore_id, portid, queueid);
> @@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
> if (nb_rx == 0)
> continue;
>
> + /* Determine the direct rearm mapping based on the first
> + * packet received on the rx queue
> + */
> + if (direct_rearm_map[i] == 0) {
> + tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
> + portid);
> + rte_eth_direct_rxrearm_map(portid, queueid,
> + tx_portid, queueid);
> + direct_rearm_map[i] = 1;
> + }
> +
That just doesn't look right to me: why to make decision based on the
first packet?
What would happen if second and all other packets have to be routed
to different ports?
In fact, this direct-rearm mode seems suitable only for hard-coded
one to one mapped forwarding (examples/l2fwd, testpmd).
For l3fwd it can be used safely only when we have one port in use.
Also I think it should be selected at init-time and
it shouldn't be on by default.
To summarize, my opinion:
special cmd-line parameter to enable it.
allowable only when we run l3fwd over one port.
> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
> || defined RTE_ARCH_PPC_64
> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (4 preceding siblings ...)
2022-04-20 8:16 ` [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode Feifei Wang
@ 2022-05-11 23:00 ` Konstantin Ananyev
[not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
` (7 subsequent siblings)
13 siblings, 0 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-11 23:00 UTC (permalink / raw)
To: Feifei Wang; +Cc: dev, nd
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
>
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
>
> However, this solution poses several constraints:
>
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
>
> If the NIC has 2 ports (there are several references), then we will have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.
Just to re-iterate some generic concerns about this proposal:
- We effectively link RX and TX queues - when this feature is enabled,
user can't stop TX queue without stopping linked RX queue first.
Right now user is free to start/stop any queues at his will.
If that feature will allow to link queues from different ports,
then even ports will become dependent and user will have to pay extra
care when managing such ports.
- very limited usage scenario - it will have a positive effect only
when we have a fixed forwarding mapping: all (or nearly all) packets
from the RX queue are forwarded into the same TX queue.
Wonder did you had a chance to consider mempool-cache ZC API,
similar to one we have for the ring?
It would allow us on TX free path to avoid copying mbufs to
temporary array on the stack.
Instead we can put them straight from TX SW ring to the mempool cache.
That should save extra store/load for mbuf and might help to achieve
some performance gain without by-passing mempool.
It probably wouldn't be as fast as what you proposing,
but might be fast enough to consider as alternative.
Again, it would be a generic one, so we can avoid all
these implications and limitations.
> 2)The other factor that needs to be considered is 'run-to-completion' vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at this
> point.
>
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
>
> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
> Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
> side.
> 3)L3fwd application is modified to do the direct rearm mapping
> automatically without user config. This follows the rules that the
> thread can map TX queue to a RX queue based on the first received
> package destination port.
>
> Testing status:
> 1.The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> enabled direct rearm
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> without fast-free mode with fast-free mode
> +14.1% +7.0%
>
> Ampere Altra(neon path):
> without fast-free mode with fast-free mode
> +17.1 +14.0%
>
> X86:
> Dell-8268(limit frequency):
> sse path:
> without fast-free mode with fast-free mode
> +6.96% +2.02%
> avx2 path:
> without fast-free mode with fast-free mode
> +9.04% +7.75%
> avx512 path:
> without fast-free mode with fast-free mode
> +5.43% +1.57%
> -------------------------------------------------------------------
> This patch can not affect base performance of normal mode.
> Furthermore, the reason for that limiting the CPU frequency is
> that dell-8268 can encounter i40e NIC bottleneck with maximum
> frequency.
>
> 2.The testing results for VPP-L3fwd are as follows:
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> with direct re-arm mode enabled
> +7.0%
> -------------------------------------------------------------------
> For Ampere Altra and X86,VPP-L3fwd test has not been done.
>
> Reference:
> [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g
>
> Feifei Wang (5):
> net/i40e: remove redundant Dtype initialization
> net/i40e: enable direct rearm mode
> ethdev: add API for direct rearm mode
> net/i40e: add direct rearm mode internal API
> examples/l3fwd: enable direct rearm mode
>
> drivers/net/i40e/i40e_ethdev.c | 34 +++
> drivers/net/i40e/i40e_rxtx.c | 4 -
> drivers/net/i40e/i40e_rxtx.h | 4 +
> drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
> drivers/net/i40e/i40e_rxtx_vec_avx2.c | 14 +-
> drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 141 ++++++++++++-
> drivers/net/i40e/i40e_rxtx_vec_sse.c | 170 ++++++++++++++-
> examples/l3fwd/l3fwd_lpm.c | 16 +-
> lib/ethdev/ethdev_driver.h | 15 ++
> lib/ethdev/rte_ethdev.c | 14 ++
> lib/ethdev/rte_ethdev.h | 31 +++
> lib/ethdev/version.map | 1 +
> 13 files changed, 949 insertions(+), 13 deletions(-)
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
[not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
@ 2022-05-24 1:25 ` Konstantin Ananyev
2022-05-24 12:40 ` Morten Brørup
` (2 more replies)
0 siblings, 3 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-24 1:25 UTC (permalink / raw)
To: Feifei Wang; +Cc: nd, dev, ruifeng.wang, honnappa.nagarahalli
16/05/2022 07:10, Feifei Wang пишет:
>
>>> Currently, the transmit side frees the buffers into the lcore cache and
>>> the receive side allocates buffers from the lcore cache. The transmit
>>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>>> lcore cache. The receive side allocates 32 buffers and stores them in
>>> the receive side software ring, resulting in 32*8=256B of stores and
>>> 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the 256B
>>> of loads and stores introduced by the lcore cache. It also frees up the
>>> cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraints:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues to
>>> pull the buffers from.
>
>> Just to re-iterate some generic concerns about this proposal:
>> - We effectively link RX and TX queues - when this feature is enabled,
>> user can't stop TX queue without stopping linked RX queue first.
>> Right now user is free to start/stop any queues at his will.
>> If that feature will allow to link queues from different ports,
>> then even ports will become dependent and user will have to pay extra
>> care when managing such ports.
>
> [Feifei] When direct rearm enabled, there are two path for thread to
> choose. If
> there are enough Tx freed buffers, Rx can put buffers from Tx.
> Otherwise, Rx will put buffers from mempool as usual. Thus, users do not
> need to pay much attention managing ports.
What I am talking about: right now different port or different queues of
the same port can be treated as independent entities:
in general user is free to start/stop (and even reconfigure in some
cases) one entity without need to stop other entity.
I.E user can stop and re-configure TX queue while keep receiving packets
from RX queue.
With direct re-arm enabled, I think it wouldn't be possible any more:
before stopping/reconfiguring TX queue user would have make sure that
corresponding RX queue wouldn't be used by datapath.
>
>> - very limited usage scenario - it will have a positive effect only
>> when we have a fixed forwarding mapping: all (or nearly all) packets
>> from the RX queue are forwarded into the same TX queue.
>
> [Feifei] Although the usage scenario is limited, this usage scenario has
> a wide
> range of applications, such as NIC with one port.
yes, there are NICs with one port, but no guarantee there wouldn't be
several such NICs within the system.
> Furtrhermore, I think this is a tradeoff between performance and
> flexibility.
> Our goal is to achieve best performance, this means we need to give up some
> flexibility decisively. For example of 'FAST_FREE Mode', it deletes most
> of the buffer check (refcnt > 1, external buffer, chain buffer), chooses a
> shorest path, and then achieve significant performance improvement.
>> Wonder did you had a chance to consider mempool-cache ZC API,
>> similar to one we have for the ring?
>> It would allow us on TX free path to avoid copying mbufs to
>> temporary array on the stack.
>> Instead we can put them straight from TX SW ring to the mempool cache.
>> That should save extra store/load for mbuf and might help to achieve
>> some performance gain without by-passing mempool.
>> It probably wouldn't be as fast as what you proposing,
>> but might be fast enough to consider as alternative.
>> Again, it would be a generic one, so we can avoid all
>> these implications and limitations.
>
> [Feifei] I think this is a good try. However, the most important thing
> is that if we can bypass the mempool decisively to pursue the
> significant performance gains.
I understand the intention, and I personally think this is wrong
and dangerous attitude.
We have mempool abstraction in place for very good reason.
So we need to try to improve mempool performance (and API if necessary)
at first place, not to avoid it and break our own rules and recommendations.
> For ZC, there maybe a problem for it in i40e. The reason for that put Tx
> buffers
> into temporary is that i40e_tx_entry includes buffer pointer and index.
> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> firstlt extract mbuf pointer. Finally, though we use ZC, we still can't
> avoid
> using a temporary stack to extract Tx buffer pointers.
When talking about ZC API for mempool cache I meant something like:
void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
uint32_t *nb_elem, uint32_t flags);
void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
nb_elem);
i.e. _start_ will return user a pointer inside mp-cache where to put
free elems and max number of slots that can be safely filled.
_finish_ will update mc->len.
As an example:
/* expect to free N mbufs */
uint32_t n = N;
void **p = mempool_cache_put_zc_start(mc, &n, ...);
/* free up to n elems */
for (i = 0; i != n; i++) {
/* get next free mbuf from somewhere */
mb = extract_and_prefree_mbuf(...);
/* no more free mbufs for now */
if (mb == NULL)
break;
p[i] = mb;
}
/* finalize ZC put, with _i_ freed elems */
mempool_cache_put_zc_finish(mc, i);
That way, I think we can overcome the issue with i40e_tx_entry
you mentioned above. Plus it might be useful in other similar places.
Another alternative is obviously to split i40e_tx_entry into two structs
(one for mbuf, second for its metadata) and have a separate array for
each of them.
Though with that approach we need to make sure no perf drops will be
introduced, plus probably more code changes will be required.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-05-24 1:25 ` Konstantin Ananyev
@ 2022-05-24 12:40 ` Morten Brørup
2022-05-24 20:14 ` Honnappa Nagarahalli
2022-06-13 5:55 ` 回复: " Feifei Wang
2 siblings, 0 replies; 145+ messages in thread
From: Morten Brørup @ 2022-05-24 12:40 UTC (permalink / raw)
To: Konstantin Ananyev, Feifei Wang
Cc: nd, dev, ruifeng.wang, honnappa.nagarahalli
> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Tuesday, 24 May 2022 03.26
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give
> up some
> > flexibility decisively. For example of 'FAST_FREE Mode', it deletes
> most
> > of the buffer check (refcnt > 1, external buffer, chain buffer),
> chooses a
> > shorest path, and then achieve significant performance improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API,
> >> similar to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to
> >> temporary array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing,
> >> but might be fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all
> >> these implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important
> thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
>
> I understand the intention, and I personally think this is wrong
> and dangerous attitude.
> We have mempool abstraction in place for very good reason.
Yes, but the abstraction is being violated grossly elsewhere, and mempool code is copy-pasted elsewhere too.
A good example of the current situation is [1]. The cache multiplier (a definition private to the mempool library) is required for some copy-pasted code, and the solution is to expose the private definition and make it part of the public API.
[1] http://inbox.dpdk.org/dev/DM4PR12MB53893BF4C7861068FE8A943BDFFE9@DM4PR12MB5389.namprd12.prod.outlook.com/
The game of abstraction has already been lost. Performance won. :-(
Since we allow bypassing the mbuf/mempool library for other features, it should be allowed for this feature too.
I would even say: Why are the drivers using the mempool library, and not the mbuf library, when freeing and allocating mbufs? This behavior bypasses all the debug assertions in the mbuf library.
As you can probably see, I'm certainly not happy about the abstraction violations in DPDK. But they have been allowed for similar features, so they should be allowed here too.
> So we need to try to improve mempool performance (and API if necessary)
> at first place, not to avoid it and break our own rules and
> recommendations.
>
>
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> Tx
> > buffers
> > into temporary is that i40e_tx_entry includes buffer pointer and
> index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> can't
> > avoid
> > using a temporary stack to extract Tx buffer pointers.
>
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags);
> void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem);
> i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
>
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>
> /* free up to n elems */
> for (i = 0; i != n; i++) {
>
> /* get next free mbuf from somewhere */
> mb = extract_and_prefree_mbuf(...);
>
> /* no more free mbufs for now */
> if (mb == NULL)
> break;
>
> p[i] = mb;
> }
>
> /* finalize ZC put, with _i_ freed elems */
> mempool_cache_put_zc_finish(mc, i);
>
> That way, I think we can overcome the issue with i40e_tx_entry
> you mentioned above. Plus it might be useful in other similar places.
Great example. This would fit perfectly into the i40e driver, if it didn't already implement the exact same by accessing the mempool cache structure directly. :-(
BTW: I tried patching the mempool library to fix some asymmetry bugs [2], but couldn't get any ACKs for it. It seems to me that the community is too risk averse to dare to modify such a core library.
[2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/
However, adding your mempool ZC feature is not a modification, but an addition, so it should be able to gather support.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-05-24 1:25 ` Konstantin Ananyev
2022-05-24 12:40 ` Morten Brørup
@ 2022-05-24 20:14 ` Honnappa Nagarahalli
2022-05-28 12:22 ` Konstantin Ananyev
2022-06-13 5:55 ` 回复: " Feifei Wang
2 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-24 20:14 UTC (permalink / raw)
To: Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
<snip>
>
> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> previously sent you email, but may not be that person. Learn why this could
> be a risk at https://aka.ms/LearnAboutSenderIdentification.]
>
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >> - We effectively link RX and TX queues - when this feature is enabled,
> >> user can't stop TX queue without stopping linked RX queue first.
> >> Right now user is free to start/stop any queues at his will.
> >> If that feature will allow to link queues from different ports,
> >> then even ports will become dependent and user will have to pay extra
> >> care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
>
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets from
> RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
I am trying to understand the problem better. For the TX queue to be stopped, the user must have blocked the data plane from accessing the TX queue. Like Feifei says, the RX side has the normal packet allocation path still available.
Also this sounds like a corner case to me, we can handle this through checks in the queue_stop API.
>
> >
> >> - very limited usage scenario - it will have a positive effect only
> >> when we have a fixed forwarding mapping: all (or nearly all) packets
> >> from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
>
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
What I see in my interactions is, a single NIC/DPU is under utilized for a 2 socket system. Some are adding more sockets to the system to better utilize the DPU. The NIC bandwidth continues to grow significantly. I do not think there will be a multi-DPU per server scenario.
>
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
>
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
The abstraction can be thought of at a higher level. i.e. the driver manages the buffer allocation/free and is hidden from the application. The application does not need to be aware of how these changes are implemented.
>
>
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
>
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void mempool_cache_put_zc_finish(struct
> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return user a
> pointer inside mp-cache where to put free elems and max number of slots
> that can be safely filled.
> _finish_ will update mc->len.
> As an example:
>
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>
> /* free up to n elems */
> for (i = 0; i != n; i++) {
>
> /* get next free mbuf from somewhere */
> mb = extract_and_prefree_mbuf(...);
>
> /* no more free mbufs for now */
> if (mb == NULL)
> break;
>
> p[i] = mb;
> }
>
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
>
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
>
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of
> them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.
Commit '5171b4ee6b6" already does this (in a different way), but just for AVX512. Unfortunately, it does not record any performance improvements. We could port this to Arm NEON and look at the performance.
>
>
>
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-05-11 22:33 ` Konstantin Ananyev
@ 2022-05-27 11:28 ` Konstantin Ananyev
2022-05-31 17:14 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-27 11:28 UTC (permalink / raw)
To: Honnappa Nagarahalli, dev
Cc: honnappanagarahalli, feifei.wang2, ruifeng.wang, nd
25/05/2022 01:24, Honnappa Nagarahalli пишет:
> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>
> 20/04/2022 09:16, Feifei Wang пишет:
>>> Enable direct rearm mode. The mapping is decided in the data plane based
>>> on the first packet received.
>>>
>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>> ---
>>> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>> index bec22c44cd..38ffdf4636 100644
>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>> unsigned lcore_id;
>>> uint64_t prev_tsc, diff_tsc, cur_tsc;
>>> int i, nb_rx;
>>> - uint16_t portid;
>>> + uint16_t portid, tx_portid;
>>> uint8_t queueid;
>>> struct lcore_conf *qconf;
>>> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>> const uint16_t n_rx_q = qconf->n_rx_queue;
>>> const uint16_t n_tx_p = qconf->n_tx_port;
>>> + int direct_rearm_map[n_rx_q];
>>> +
>>> if (n_rx_q == 0) {
>>> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
>>> lcore_id);
>>> return 0;
>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>> portid = qconf->rx_queue_list[i].port_id;
>>> queueid = qconf->rx_queue_list[i].queue_id;
>>> + direct_rearm_map[i] = 0;
>>> RTE_LOG(INFO, L3FWD,
>>> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>> lcore_id, portid, queueid);
>>> @@ -209,6 +212,17 @@ lpm_main_loop(__rte_unused void *dummy)
>>> if (nb_rx == 0)
>>> continue;
>>> + /* Determine the direct rearm mapping based on the first
>>> + * packet received on the rx queue
>>> + */
>>> + if (direct_rearm_map[i] == 0) {
>>> + tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>> + portid);
>>> + rte_eth_direct_rxrearm_map(portid, queueid,
>>> + tx_portid, queueid);
>>> + direct_rearm_map[i] = 1;
>>> + }
>>> +
>
>> That just doesn't look right to me: why to make decision based on the
>> first packet?
> The TX queue depends on the incoming packet. So, this method covers more
> scenarios
> than doing it in the control plane where the outgoing queue is not known.
>
>
>> What would happen if second and all other packets have to be routed
>> to different ports?
> This is an example application and it should be fine to make this
> assumption.
> More over, it does not cause any problems if packets change in between.
> When
> the packets change back, the feature works again.
>
>> In fact, this direct-rearm mode seems suitable only for hard-coded
>> one to one mapped forwarding (examples/l2fwd, testpmd).
>> For l3fwd it can be used safely only when we have one port in use.
> Can you elaborate more on the safety issue when more than one port is used?
>
>> Also I think it should be selected at init-time and
>> it shouldn't be on by default.
>> To summarize, my opinion:
>> special cmd-line parameter to enable it.
> Can you please elaborate why a command line parameter is required? Other
> similar
> features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are enabled without a
> command
> line parameter. IMO, this is how it should ber. Essentially we are
> trying to measure
> how different PMDs perform, the ones that have implemented performance
> improvement
> features would show better performance (i.e. the PMDs implementing the
> features
> should not be penalized by asking for additional user input).
From my perspective, main purpose of l3fwd application is to demonstrate
DPDK ability to do packet routing based on input packet contents.
Making guesses about packet contents is a change in expected behavior.
For some cases it might improve performance, for many others -
will most likely cause performance drop.
I think that performance drop as default behavior
(running the same parameters as before) should not be allowed.
Plus you did not provided ability to switch off that behavior,
if undesired.
About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
default enablement - I don't think it is correct.
Within l3fwd app we can safely guarantee that all
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
in each TX queue all mbufs will belong to the same mempool and
their refcnt will always equal to one.
Here you are making guesses about contents of input packets,
without any guarantee that you guess will always be valid.
BTW, what's wrong with using l2fwd to demonstrate that feature?
Seems like a natural choice to me.
>> allowable only when we run l3fwd over one port.
>
>
>>> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>> || defined RTE_ARCH_PPC_64
>>> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-05-24 20:14 ` Honnappa Nagarahalli
@ 2022-05-28 12:22 ` Konstantin Ananyev
2022-06-01 1:00 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2022-05-28 12:22 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli
24/05/2022 21:14, Honnappa Nagarahalli пишет:
> <snip>
>
>>
>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
>> previously sent you email, but may not be that person. Learn why this could
>> be a risk at https://aka.ms/LearnAboutSenderIdentification.]
>>
>> 16/05/2022 07:10, Feifei Wang пишет:
>>>
>>>>> Currently, the transmit side frees the buffers into the lcore cache
>>>>> and the receive side allocates buffers from the lcore cache. The
>>>>> transmit side typically frees 32 buffers resulting in 32*8=256B of
>>>>> stores to lcore cache. The receive side allocates 32 buffers and
>>>>> stores them in the receive side software ring, resulting in
>>>>> 32*8=256B of stores and 256B of load from the lcore cache.
>>>>>
>>>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>>>> the lcore cache. i.e. the receive side will free the buffers from
>>>>> transmit side directly into it's software ring. This will avoid the
>>>>> 256B of loads and stores introduced by the lcore cache. It also
>>>>> frees up the cache lines used by the lcore cache.
>>>>>
>>>>> However, this solution poses several constraints:
>>>>>
>>>>> 1)The receive queue needs to know which transmit queue it should
>>>>> take the buffers from. The application logic decides which transmit
>>>>> port to use to send out the packets. In many use cases the NIC might
>>>>> have a single port ([1], [2], [3]), in which case a given transmit
>>>>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
>>>>> queue). This is easy to configure.
>>>>>
>>>>> If the NIC has 2 ports (there are several references), then we will
>>>>> have
>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>>>> However, if this is generalized to 'N' ports, the configuration can
>>>>> be long. More over the PMD would have to scan a list of transmit
>>>>> queues to pull the buffers from.
>>>
>>>> Just to re-iterate some generic concerns about this proposal:
>>>> - We effectively link RX and TX queues - when this feature is enabled,
>>>> user can't stop TX queue without stopping linked RX queue first.
>>>> Right now user is free to start/stop any queues at his will.
>>>> If that feature will allow to link queues from different ports,
>>>> then even ports will become dependent and user will have to pay extra
>>>> care when managing such ports.
>>>
>>> [Feifei] When direct rearm enabled, there are two path for thread to
>>> choose. If there are enough Tx freed buffers, Rx can put buffers from
>>> Tx.
>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
>>> not need to pay much attention managing ports.
>>
>> What I am talking about: right now different port or different queues of the
>> same port can be treated as independent entities:
>> in general user is free to start/stop (and even reconfigure in some
>> cases) one entity without need to stop other entity.
>> I.E user can stop and re-configure TX queue while keep receiving packets from
>> RX queue.
>> With direct re-arm enabled, I think it wouldn't be possible any more:
>> before stopping/reconfiguring TX queue user would have make sure that
>> corresponding RX queue wouldn't be used by datapath.
> I am trying to understand the problem better. For the TX queue to be stopped, the user must have blocked the data plane from accessing the TX queue.
Surely it is user responsibility tnot to call tx_burst()
for stopped/released queue.
The problem is that while TX for that queue is stopped,
RX for related queue still can continue.
So rx_burst() will try to read/modify TX queue data,
that might be already freed, or simultaneously modified by control path.
Again, it all can be mitigated by carefully re-designing and
modifying control and data-path inside user app -
by doing extra checks and synchronizations, etc.
But from practical point - I presume most of users simply would avoid
using this feature due all potential problems it might cause.
> Like Feifei says, the RX side has the normal packet allocation path still available.
> Also this sounds like a corner case to me, we can handle this through checks in the queue_stop API.
Depends.
if it would be allowed to link queues only from the same port,
then yes, extra checks for queue-stop might be enough.
As right now DPDK doesn't allow user to change number of queues
without dev_stop() first.
Though if it would be allowed to link queues from different ports,
then situation will be much worse.
Right now ports are totally independent entities
(except some special cases like link-bonding, etc.).
As one port can keep doing RX/TX, second one can be stopped,
re-confgured, even detached, and newly attached device might
re-use same port number.
>>
>>>
>>>> - very limited usage scenario - it will have a positive effect only
>>>> when we have a fixed forwarding mapping: all (or nearly all) packets
>>>> from the RX queue are forwarded into the same TX queue.
>>>
>>> [Feifei] Although the usage scenario is limited, this usage scenario
>>> has a wide range of applications, such as NIC with one port.
>>
>> yes, there are NICs with one port, but no guarantee there wouldn't be several
>> such NICs within the system.
> What I see in my interactions is, a single NIC/DPU is under utilized for a 2 socket system. Some are adding more sockets to the system to better utilize the DPU. The NIC bandwidth continues to grow significantly. I do not think there will be a multi-DPU per server scenario.
Interesting... from my experience it is visa-versa:
in many cases 200Gb/s is not that much these days
to saturate modern 2 socket x86 server.
Though I suppose a lot depends on particular HW and actual workload.
>
>>
>>> Furtrhermore, I think this is a tradeoff between performance and
>>> flexibility.
>>> Our goal is to achieve best performance, this means we need to give up
>>> some flexibility decisively. For example of 'FAST_FREE Mode', it
>>> deletes most of the buffer check (refcnt > 1, external buffer, chain
>>> buffer), chooses a shorest path, and then achieve significant performance
>> improvement.
>>>> Wonder did you had a chance to consider mempool-cache ZC API, similar
>>>> to one we have for the ring?
>>>> It would allow us on TX free path to avoid copying mbufs to temporary
>>>> array on the stack.
>>>> Instead we can put them straight from TX SW ring to the mempool cache.
>>>> That should save extra store/load for mbuf and might help to achieve
>>>> some performance gain without by-passing mempool.
>>>> It probably wouldn't be as fast as what you proposing, but might be
>>>> fast enough to consider as alternative.
>>>> Again, it would be a generic one, so we can avoid all these
>>>> implications and limitations.
>>>
>>> [Feifei] I think this is a good try. However, the most important thing
>>> is that if we can bypass the mempool decisively to pursue the
>>> significant performance gains.
>>
>> I understand the intention, and I personally think this is wrong and dangerous
>> attitude.
>> We have mempool abstraction in place for very good reason.
>> So we need to try to improve mempool performance (and API if necessary) at
>> first place, not to avoid it and break our own rules and recommendations.
> The abstraction can be thought of at a higher level. i.e. the driver manages the buffer allocation/free and is hidden from the application. The application does not need to be aware of how these changes are implemented.
>
>>
>>
>>> For ZC, there maybe a problem for it in i40e. The reason for that put
>>> Tx buffers into temporary is that i40e_tx_entry includes buffer
>>> pointer and index.
>>> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
>>> firstlt extract mbuf pointer. Finally, though we use ZC, we still
>>> can't avoid using a temporary stack to extract Tx buffer pointers.
>>
>> When talking about ZC API for mempool cache I meant something like:
>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
>> uint32_t *nb_elem, uint32_t flags); void mempool_cache_put_zc_finish(struct
>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return user a
>> pointer inside mp-cache where to put free elems and max number of slots
>> that can be safely filled.
>> _finish_ will update mc->len.
>> As an example:
>>
>> /* expect to free N mbufs */
>> uint32_t n = N;
>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>>
>> /* free up to n elems */
>> for (i = 0; i != n; i++) {
>>
>> /* get next free mbuf from somewhere */
>> mb = extract_and_prefree_mbuf(...);
>>
>> /* no more free mbufs for now */
>> if (mb == NULL)
>> break;
>>
>> p[i] = mb;
>> }
>>
>> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
>> i);
>>
>> That way, I think we can overcome the issue with i40e_tx_entry you
>> mentioned above. Plus it might be useful in other similar places.
>>
>> Another alternative is obviously to split i40e_tx_entry into two structs (one
>> for mbuf, second for its metadata) and have a separate array for each of
>> them.
>> Though with that approach we need to make sure no perf drops will be
>> introduced, plus probably more code changes will be required.
> Commit '5171b4ee6b6" already does this (in a different way), but just for AVX512. Unfortunately, it does not record any performance improvements. We could port this to Arm NEON and look at the performance.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-05-27 11:28 ` Konstantin Ananyev
@ 2022-05-31 17:14 ` Honnappa Nagarahalli
2022-06-03 10:32 ` Andrew Rybchenko
2022-06-06 11:27 ` Konstantin Ananyev
0 siblings, 2 replies; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-05-31 17:14 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: honnappanagarahalli, Feifei Wang, Ruifeng Wang, nd, nd
<snip>
>
> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
> > From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> >
> > 20/04/2022 09:16, Feifei Wang пишет:
> >>> Enable direct rearm mode. The mapping is decided in the data plane
> >>> based on the first packet received.
> >>>
> >>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>> ---
> >>> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
> >>> 1 file changed, 15 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
> >>> index bec22c44cd..38ffdf4636 100644
> >>> --- a/examples/l3fwd/l3fwd_lpm.c
> >>> +++ b/examples/l3fwd/l3fwd_lpm.c
> >>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>> unsigned lcore_id;
> >>> uint64_t prev_tsc, diff_tsc, cur_tsc;
> >>> int i, nb_rx;
> >>> - uint16_t portid;
> >>> + uint16_t portid, tx_portid;
> >>> uint8_t queueid;
> >>> struct lcore_conf *qconf;
> >>> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
> >>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
> >>> const uint16_t n_rx_q = qconf->n_rx_queue;
> >>> const uint16_t n_tx_p = qconf->n_tx_port;
> >>> + int direct_rearm_map[n_rx_q];
> >>> +
> >>> if (n_rx_q == 0) {
> >>> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
> >>> lcore_id);
> >>> return 0;
> >>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>> portid = qconf->rx_queue_list[i].port_id;
> >>> queueid = qconf->rx_queue_list[i].queue_id;
> >>> + direct_rearm_map[i] = 0;
> >>> RTE_LOG(INFO, L3FWD,
> >>> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
> >>> lcore_id, portid, queueid); @@ -209,6 +212,17 @@
> >>> lpm_main_loop(__rte_unused void *dummy)
> >>> if (nb_rx == 0)
> >>> continue;
> >>> + /* Determine the direct rearm mapping based on the
> >>> +first
> >>> + * packet received on the rx queue
> >>> + */
> >>> + if (direct_rearm_map[i] == 0) {
> >>> + tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
> >>> + portid);
> >>> + rte_eth_direct_rxrearm_map(portid, queueid,
> >>> + tx_portid, queueid);
> >>> + direct_rearm_map[i] = 1;
> >>> + }
> >>> +
> >
> >> That just doesn't look right to me: why to make decision based on the
> >> first packet?
> > The TX queue depends on the incoming packet. So, this method covers
> > more scenarios than doing it in the control plane where the outgoing
> > queue is not known.
> >
> >
> >> What would happen if second and all other packets have to be routed
> >> to different ports?
> > This is an example application and it should be fine to make this
> > assumption.
> > More over, it does not cause any problems if packets change in between.
> > When
> > the packets change back, the feature works again.
> >
> >> In fact, this direct-rearm mode seems suitable only for hard-coded
> >> one to one mapped forwarding (examples/l2fwd, testpmd).
> >> For l3fwd it can be used safely only when we have one port in use.
> > Can you elaborate more on the safety issue when more than one port is
> used?
> >
> >> Also I think it should be selected at init-time and it shouldn't be
> >> on by default.
> >> To summarize, my opinion:
> >> special cmd-line parameter to enable it.
> > Can you please elaborate why a command line parameter is required?
> > Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
> > enabled without a command line parameter. IMO, this is how it should
> > ber. Essentially we are trying to measure how different PMDs perform,
> > the ones that have implemented performance improvement features
> would
> > show better performance (i.e. the PMDs implementing the features
> > should not be penalized by asking for additional user input).
>
> From my perspective, main purpose of l3fwd application is to demonstrate
> DPDK ability to do packet routing based on input packet contents.
> Making guesses about packet contents is a change in expected behavior.
> For some cases it might improve performance, for many others - will most
> likely cause performance drop.
> I think that performance drop as default behavior (running the same
> parameters as before) should not be allowed.
> Plus you did not provided ability to switch off that behavior, if undesired.
There is no drop in L3fwd performance due to this patch.
>
> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
> enablement - I don't think it is correct.
> Within l3fwd app we can safely guarantee that all
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
> in each TX queue all mbufs will belong to the same mempool and their refcnt
> will always equal to one.
> Here you are making guesses about contents of input packets, without any
> guarantee that you guess will always be valid.
This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct. Since it is a sample application, we do not expect the traffic to be complex. If it is complex, the performance will be the same as before or better.
>
> BTW, what's wrong with using l2fwd to demonstrate that feature?
> Seems like a natural choice to me.
The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.
>
> >> allowable only when we run l3fwd over one port.
> >
> >
> >>> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
> >>> || defined RTE_ARCH_PPC_64
> >>> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-05-28 12:22 ` Konstantin Ananyev
@ 2022-06-01 1:00 ` Honnappa Nagarahalli
2022-06-03 23:32 ` Konstantin Ananyev
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-01 1:00 UTC (permalink / raw)
To: Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
<snip>
> >
> >>
> >> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> >> previously sent you email, but may not be that person. Learn why this
> >> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> >>
> >> 16/05/2022 07:10, Feifei Wang пишет:
> >>>
> >>>>> Currently, the transmit side frees the buffers into the lcore
> >>>>> cache and the receive side allocates buffers from the lcore cache.
> >>>>> The transmit side typically frees 32 buffers resulting in
> >>>>> 32*8=256B of stores to lcore cache. The receive side allocates 32
> >>>>> buffers and stores them in the receive side software ring,
> >>>>> resulting in 32*8=256B of stores and 256B of load from the lcore cache.
> >>>>>
> >>>>> This patch proposes a mechanism to avoid freeing to/allocating
> >>>>> from the lcore cache. i.e. the receive side will free the buffers
> >>>>> from transmit side directly into it's software ring. This will
> >>>>> avoid the 256B of loads and stores introduced by the lcore cache.
> >>>>> It also frees up the cache lines used by the lcore cache.
> >>>>>
> >>>>> However, this solution poses several constraints:
> >>>>>
> >>>>> 1)The receive queue needs to know which transmit queue it should
> >>>>> take the buffers from. The application logic decides which
> >>>>> transmit port to use to send out the packets. In many use cases
> >>>>> the NIC might have a single port ([1], [2], [3]), in which case a
> >>>>> given transmit queue is always mapped to a single receive queue
> >>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> >>>>>
> >>>>> If the NIC has 2 ports (there are several references), then we
> >>>>> will have
> >>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>>>> However, if this is generalized to 'N' ports, the configuration
> >>>>> can be long. More over the PMD would have to scan a list of
> >>>>> transmit queues to pull the buffers from.
> >>>
> >>>> Just to re-iterate some generic concerns about this proposal:
> >>>> - We effectively link RX and TX queues - when this feature is enabled,
> >>>> user can't stop TX queue without stopping linked RX queue first.
> >>>> Right now user is free to start/stop any queues at his will.
> >>>> If that feature will allow to link queues from different ports,
> >>>> then even ports will become dependent and user will have to pay extra
> >>>> care when managing such ports.
> >>>
> >>> [Feifei] When direct rearm enabled, there are two path for thread to
> >>> choose. If there are enough Tx freed buffers, Rx can put buffers
> >>> from Tx.
> >>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> >>> not need to pay much attention managing ports.
> >>
> >> What I am talking about: right now different port or different queues
> >> of the same port can be treated as independent entities:
> >> in general user is free to start/stop (and even reconfigure in some
> >> cases) one entity without need to stop other entity.
> >> I.E user can stop and re-configure TX queue while keep receiving
> >> packets from RX queue.
> >> With direct re-arm enabled, I think it wouldn't be possible any more:
> >> before stopping/reconfiguring TX queue user would have make sure that
> >> corresponding RX queue wouldn't be used by datapath.
> > I am trying to understand the problem better. For the TX queue to be stopped,
> the user must have blocked the data plane from accessing the TX queue.
>
> Surely it is user responsibility tnot to call tx_burst() for stopped/released queue.
> The problem is that while TX for that queue is stopped, RX for related queue still
> can continue.
> So rx_burst() will try to read/modify TX queue data, that might be already freed,
> or simultaneously modified by control path.
Understood, agree on the issue
>
> Again, it all can be mitigated by carefully re-designing and modifying control and
> data-path inside user app - by doing extra checks and synchronizations, etc.
> But from practical point - I presume most of users simply would avoid using this
> feature due all potential problems it might cause.
That is subjective, it all depends on the performance improvements users see in their application.
IMO, the performance improvement seen with this patch is worth few changes.
>
> > Like Feifei says, the RX side has the normal packet allocation path still available.
> > Also this sounds like a corner case to me, we can handle this through checks in
> the queue_stop API.
>
> Depends.
> if it would be allowed to link queues only from the same port, then yes, extra
> checks for queue-stop might be enough.
> As right now DPDK doesn't allow user to change number of queues without
> dev_stop() first.
> Though if it would be allowed to link queues from different ports, then situation
> will be much worse.
> Right now ports are totally independent entities (except some special cases like
> link-bonding, etc.).
> As one port can keep doing RX/TX, second one can be stopped, re-confgured,
> even detached, and newly attached device might re-use same port number.
I see this as a similar restriction to the one discussed above. Do you see any issues if we enforce this with checks?
>
>
> >>
> >>>
> >>>> - very limited usage scenario - it will have a positive effect only
> >>>> when we have a fixed forwarding mapping: all (or nearly all) packets
> >>>> from the RX queue are forwarded into the same TX queue.
> >>>
> >>> [Feifei] Although the usage scenario is limited, this usage scenario
> >>> has a wide range of applications, such as NIC with one port.
> >>
> >> yes, there are NICs with one port, but no guarantee there wouldn't be
> >> several such NICs within the system.
> > What I see in my interactions is, a single NIC/DPU is under utilized for a 2
> socket system. Some are adding more sockets to the system to better utilize the
> DPU. The NIC bandwidth continues to grow significantly. I do not think there will
> be a multi-DPU per server scenario.
>
>
> Interesting... from my experience it is visa-versa:
> in many cases 200Gb/s is not that much these days to saturate modern 2 socket
> x86 server.
> Though I suppose a lot depends on particular HW and actual workload.
>
> >
> >>
> >>> Furtrhermore, I think this is a tradeoff between performance and
> >>> flexibility.
> >>> Our goal is to achieve best performance, this means we need to give
> >>> up some flexibility decisively. For example of 'FAST_FREE Mode', it
> >>> deletes most of the buffer check (refcnt > 1, external buffer, chain
> >>> buffer), chooses a shorest path, and then achieve significant
> >>> performance
> >> improvement.
> >>>> Wonder did you had a chance to consider mempool-cache ZC API,
> >>>> similar to one we have for the ring?
> >>>> It would allow us on TX free path to avoid copying mbufs to
> >>>> temporary array on the stack.
> >>>> Instead we can put them straight from TX SW ring to the mempool cache.
> >>>> That should save extra store/load for mbuf and might help to
> >>>> achieve some performance gain without by-passing mempool.
> >>>> It probably wouldn't be as fast as what you proposing, but might be
> >>>> fast enough to consider as alternative.
> >>>> Again, it would be a generic one, so we can avoid all these
> >>>> implications and limitations.
> >>>
> >>> [Feifei] I think this is a good try. However, the most important
> >>> thing is that if we can bypass the mempool decisively to pursue the
> >>> significant performance gains.
> >>
> >> I understand the intention, and I personally think this is wrong and
> >> dangerous attitude.
> >> We have mempool abstraction in place for very good reason.
> >> So we need to try to improve mempool performance (and API if
> >> necessary) at first place, not to avoid it and break our own rules and
> recommendations.
> > The abstraction can be thought of at a higher level. i.e. the driver manages the
> buffer allocation/free and is hidden from the application. The application does
> not need to be aware of how these changes are implemented.
> >
> >>
> >>
> >>> For ZC, there maybe a problem for it in i40e. The reason for that
> >>> put Tx buffers into temporary is that i40e_tx_entry includes buffer
> >>> pointer and index.
> >>> Thus we cannot put Tx SW_ring entry into mempool directly, we need
> >>> to firstlt extract mbuf pointer. Finally, though we use ZC, we still
> >>> can't avoid using a temporary stack to extract Tx buffer pointers.
> >>
> >> When talking about ZC API for mempool cache I meant something like:
> >> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> >> uint32_t *nb_elem, uint32_t flags); void
> >> mempool_cache_put_zc_finish(struct
> >> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
> >> user a pointer inside mp-cache where to put free elems and max number
> >> of slots that can be safely filled.
> >> _finish_ will update mc->len.
> >> As an example:
> >>
> >> /* expect to free N mbufs */
> >> uint32_t n = N;
> >> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> >>
> >> /* free up to n elems */
> >> for (i = 0; i != n; i++) {
> >>
> >> /* get next free mbuf from somewhere */
> >> mb = extract_and_prefree_mbuf(...);
> >>
> >> /* no more free mbufs for now */
> >> if (mb == NULL)
> >> break;
> >>
> >> p[i] = mb;
> >> }
> >>
> >> /* finalize ZC put, with _i_ freed elems */
> >> mempool_cache_put_zc_finish(mc, i);
> >>
> >> That way, I think we can overcome the issue with i40e_tx_entry you
> >> mentioned above. Plus it might be useful in other similar places.
> >>
> >> Another alternative is obviously to split i40e_tx_entry into two
> >> structs (one for mbuf, second for its metadata) and have a separate
> >> array for each of them.
> >> Though with that approach we need to make sure no perf drops will be
> >> introduced, plus probably more code changes will be required.
> > Commit '5171b4ee6b6" already does this (in a different way), but just for
> AVX512. Unfortunately, it does not record any performance improvements. We
> could port this to Arm NEON and look at the performance.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
2022-05-10 22:49 ` Honnappa Nagarahalli
@ 2022-06-03 10:19 ` Andrew Rybchenko
0 siblings, 0 replies; 145+ messages in thread
From: Andrew Rybchenko @ 2022-06-03 10:19 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang, Ray Kinsella
Cc: dev, nd, Ruifeng Wang, thomas, Ferruh Yigit
On 5/11/22 01:49, Honnappa Nagarahalli wrote:
>>> On 4/20/22 11:16, Feifei Wang wrote:
>>>> Add API for enabling direct rearm mode and for mapping RX and TX
>>>> queues. Currently, the API supports 1:1(txq : rxq) mapping.
>>>>
>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> + (*dev->dev_ops->rx_queue_direct_rearm_enable)(dev,
>>> rx_queue_id);
>>>> + (*dev->dev_ops->rx_queue_direct_rearm_map)(dev, rx_queue_id,
>>>> + tx_port_id, tx_queue_id);
>>>
>>> We must check that function pointers are not NULL as usual.
>>> Return values must be checked.
>> [Feifei] I agree with this, The check for pointer and return value will be added
>>
>>> Isn't is safe to setup map and than enable.
>>> Otherwise we definitely need disable.
>> [Feifei] I will change code that map first and then set 'rxq->offload' to enable
>> direct-rearm mode.
>>
>>> Also, what should happen on Tx port unplug? How to continue if we
>>> still have Rx port up and running?
>> [Feifei] For direct rearm mode, if Tx port unplug, it means there is no buffer
>> from Tx.
>> And then, Rx will put buffer from mempool as usual for rearm.
> Andrew, when you say 'TX port unplug', do you mean the 'rte_eth_dev_tx_queue_stop' is called? Is calling 'rte_eth_dev_tx_queue_stop' allowed when the device is running?
I think that deferred start and presence of
rte_eth_dev_tx_queue_stop() implies the possibility to stop
Tx queue. But, yes, application should care about conditions
to have no traffic running to Tx queue.
Anyway, I was talking about hot unplug of the entire
device used as Tx port in above config.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-05-31 17:14 ` Honnappa Nagarahalli
@ 2022-06-03 10:32 ` Andrew Rybchenko
2022-06-06 11:27 ` Konstantin Ananyev
1 sibling, 0 replies; 145+ messages in thread
From: Andrew Rybchenko @ 2022-06-03 10:32 UTC (permalink / raw)
To: Honnappa Nagarahalli, Konstantin Ananyev, dev
Cc: honnappanagarahalli, Feifei Wang, Ruifeng Wang, nd
On 5/31/22 20:14, Honnappa Nagarahalli wrote:
> <snip>
>>
>> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
>>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>>>
>>> 20/04/2022 09:16, Feifei Wang пишет:
>>>>> Enable direct rearm mode. The mapping is decided in the data plane
>>>>> based on the first packet received.
>>>>>
>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> ---
>>>>> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>>>> index bec22c44cd..38ffdf4636 100644
>>>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> unsigned lcore_id;
>>>>> uint64_t prev_tsc, diff_tsc, cur_tsc;
>>>>> int i, nb_rx;
>>>>> - uint16_t portid;
>>>>> + uint16_t portid, tx_portid;
>>>>> uint8_t queueid;
>>>>> struct lcore_conf *qconf;
>>>>> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> const uint16_t n_rx_q = qconf->n_rx_queue;
>>>>> const uint16_t n_tx_p = qconf->n_tx_port;
>>>>> + int direct_rearm_map[n_rx_q];
>>>>> +
>>>>> if (n_rx_q == 0) {
>>>>> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
>>>>> lcore_id);
>>>>> return 0;
>>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> portid = qconf->rx_queue_list[i].port_id;
>>>>> queueid = qconf->rx_queue_list[i].queue_id;
>>>>> + direct_rearm_map[i] = 0;
>>>>> RTE_LOG(INFO, L3FWD,
>>>>> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>>>> lcore_id, portid, queueid); @@ -209,6 +212,17 @@
>>>>> lpm_main_loop(__rte_unused void *dummy)
>>>>> if (nb_rx == 0)
>>>>> continue;
>>>>> + /* Determine the direct rearm mapping based on the
>>>>> +first
>>>>> + * packet received on the rx queue
>>>>> + */
>>>>> + if (direct_rearm_map[i] == 0) {
>>>>> + tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>>>> + portid);
>>>>> + rte_eth_direct_rxrearm_map(portid, queueid,
>>>>> + tx_portid, queueid);
>>>>> + direct_rearm_map[i] = 1;
>>>>> + }
>>>>> +
>>>
>>>> That just doesn't look right to me: why to make decision based on the
>>>> first packet?
>>> The TX queue depends on the incoming packet. So, this method covers
>>> more scenarios than doing it in the control plane where the outgoing
>>> queue is not known.
>>>
>>>
>>>> What would happen if second and all other packets have to be routed
>>>> to different ports?
>>> This is an example application and it should be fine to make this
>>> assumption.
>>> More over, it does not cause any problems if packets change in between.
>>> When
>>> the packets change back, the feature works again.
>>>
>>>> In fact, this direct-rearm mode seems suitable only for hard-coded
>>>> one to one mapped forwarding (examples/l2fwd, testpmd).
>>>> For l3fwd it can be used safely only when we have one port in use.
>>> Can you elaborate more on the safety issue when more than one port is
>> used?
>>>
>>>> Also I think it should be selected at init-time and it shouldn't be
>>>> on by default.
>>>> To summarize, my opinion:
>>>> special cmd-line parameter to enable it.
>>> Can you please elaborate why a command line parameter is required?
>>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
>>> enabled without a command line parameter. IMO, this is how it should
>>> ber. Essentially we are trying to measure how different PMDs perform,
>>> the ones that have implemented performance improvement features
>> would
>>> show better performance (i.e. the PMDs implementing the features
>>> should not be penalized by asking for additional user input).
>>
>> From my perspective, main purpose of l3fwd application is to demonstrate
>> DPDK ability to do packet routing based on input packet contents.
>> Making guesses about packet contents is a change in expected behavior.
>> For some cases it might improve performance, for many others - will most
>> likely cause performance drop.
>> I think that performance drop as default behavior (running the same
>> parameters as before) should not be allowed.
>> Plus you did not provided ability to switch off that behavior, if undesired.
> There is no drop in L3fwd performance due to this patch.
It is a questionable statement. Sorry. The patch adds code on
data path. So, I'm almost confident that it is possible to find
a case when performance slightly drops. May be it is always
minor and acceptable.
>>
>> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
>> enablement - I don't think it is correct.
>> Within l3fwd app we can safely guarantee that all
>> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
>> in each TX queue all mbufs will belong to the same mempool and their refcnt
>> will always equal to one.
>> Here you are making guesses about contents of input packets, without any
>> guarantee that you guess will always be valid.
> This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct. Since it is a sample application, we do not expect the traffic to be complex. If it is complex, the performance will be the same as before or better.
>
>>
>> BTW, what's wrong with using l2fwd to demonstrate that feature?
>> Seems like a natural choice to me.
> The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.
>
>>
>>>> allowable only when we run l3fwd over one port.
>>>
>>>
>>>>> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>>>> || defined RTE_ARCH_PPC_64
>>>>> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-06-01 1:00 ` Honnappa Nagarahalli
@ 2022-06-03 23:32 ` Konstantin Ananyev
2022-06-04 8:07 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2022-06-03 23:32 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli
> <snip>
>>>
>>>>
>>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
>>>> previously sent you email, but may not be that person. Learn why this
>>>> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> 16/05/2022 07:10, Feifei Wang пишет:
>>>>>
>>>>>>> Currently, the transmit side frees the buffers into the lcore
>>>>>>> cache and the receive side allocates buffers from the lcore cache.
>>>>>>> The transmit side typically frees 32 buffers resulting in
>>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates 32
>>>>>>> buffers and stores them in the receive side software ring,
>>>>>>> resulting in 32*8=256B of stores and 256B of load from the lcore cache.
>>>>>>>
>>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
>>>>>>> from the lcore cache. i.e. the receive side will free the buffers
>>>>>>> from transmit side directly into it's software ring. This will
>>>>>>> avoid the 256B of loads and stores introduced by the lcore cache.
>>>>>>> It also frees up the cache lines used by the lcore cache.
>>>>>>>
>>>>>>> However, this solution poses several constraints:
>>>>>>>
>>>>>>> 1)The receive queue needs to know which transmit queue it should
>>>>>>> take the buffers from. The application logic decides which
>>>>>>> transmit port to use to send out the packets. In many use cases
>>>>>>> the NIC might have a single port ([1], [2], [3]), in which case a
>>>>>>> given transmit queue is always mapped to a single receive queue
>>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
>>>>>>>
>>>>>>> If the NIC has 2 ports (there are several references), then we
>>>>>>> will have
>>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>>>>>> However, if this is generalized to 'N' ports, the configuration
>>>>>>> can be long. More over the PMD would have to scan a list of
>>>>>>> transmit queues to pull the buffers from.
>>>>>
>>>>>> Just to re-iterate some generic concerns about this proposal:
>>>>>> - We effectively link RX and TX queues - when this feature is enabled,
>>>>>> user can't stop TX queue without stopping linked RX queue first.
>>>>>> Right now user is free to start/stop any queues at his will.
>>>>>> If that feature will allow to link queues from different ports,
>>>>>> then even ports will become dependent and user will have to pay extra
>>>>>> care when managing such ports.
>>>>>
>>>>> [Feifei] When direct rearm enabled, there are two path for thread to
>>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
>>>>> from Tx.
>>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
>>>>> not need to pay much attention managing ports.
>>>>
>>>> What I am talking about: right now different port or different queues
>>>> of the same port can be treated as independent entities:
>>>> in general user is free to start/stop (and even reconfigure in some
>>>> cases) one entity without need to stop other entity.
>>>> I.E user can stop and re-configure TX queue while keep receiving
>>>> packets from RX queue.
>>>> With direct re-arm enabled, I think it wouldn't be possible any more:
>>>> before stopping/reconfiguring TX queue user would have make sure that
>>>> corresponding RX queue wouldn't be used by datapath.
>>> I am trying to understand the problem better. For the TX queue to be stopped,
>> the user must have blocked the data plane from accessing the TX queue.
>>
>> Surely it is user responsibility tnot to call tx_burst() for stopped/released queue.
>> The problem is that while TX for that queue is stopped, RX for related queue still
>> can continue.
>> So rx_burst() will try to read/modify TX queue data, that might be already freed,
>> or simultaneously modified by control path.
> Understood, agree on the issue
>
>>
>> Again, it all can be mitigated by carefully re-designing and modifying control and
>> data-path inside user app - by doing extra checks and synchronizations, etc.
>> But from practical point - I presume most of users simply would avoid using this
>> feature due all potential problems it might cause.
> That is subjective, it all depends on the performance improvements users see in their application.
> IMO, the performance improvement seen with this patch is worth few changes.
Yes, it is subjective till some extent, though my feeling
that it might end-up being sort of synthetic improvement used only
by some show-case benchmarks.
From my perspective, it would be much more plausible,
if we can introduce some sort of generic improvement, that doesn't
impose all these extra constraints and implications.
Like one, discussed below in that thread with ZC mempool approach.
>>
>>> Like Feifei says, the RX side has the normal packet allocation path still available.
>>> Also this sounds like a corner case to me, we can handle this through checks in
>> the queue_stop API.
>>
>> Depends.
>> if it would be allowed to link queues only from the same port, then yes, extra
>> checks for queue-stop might be enough.
>> As right now DPDK doesn't allow user to change number of queues without
>> dev_stop() first.
>> Though if it would be allowed to link queues from different ports, then situation
>> will be much worse.
>> Right now ports are totally independent entities (except some special cases like
>> link-bonding, etc.).
>> As one port can keep doing RX/TX, second one can be stopped, re-confgured,
>> even detached, and newly attached device might re-use same port number.
> I see this as a similar restriction to the one discussed above.
Yes, they are similar in principal, though I think that the case
with queues from different port would make things much more complex.
> Do you see any issues if we enforce this with checks?
Hard to tell straightway, a lot will depend how smart
such implementation would be.
Usually DPDK tends not to avoid heavy
synchronizations within its data-path functions.
>>
>>
>>>>
>>>>>
>>>>>> - very limited usage scenario - it will have a positive effect only
>>>>>> when we have a fixed forwarding mapping: all (or nearly all) packets
>>>>>> from the RX queue are forwarded into the same TX queue.
>>>>>
>>>>> [Feifei] Although the usage scenario is limited, this usage scenario
>>>>> has a wide range of applications, such as NIC with one port.
>>>>
>>>> yes, there are NICs with one port, but no guarantee there wouldn't be
>>>> several such NICs within the system.
>>> What I see in my interactions is, a single NIC/DPU is under utilized for a 2
>> socket system. Some are adding more sockets to the system to better utilize the
>> DPU. The NIC bandwidth continues to grow significantly. I do not think there will
>> be a multi-DPU per server scenario.
>>
>>
>> Interesting... from my experience it is visa-versa:
>> in many cases 200Gb/s is not that much these days to saturate modern 2 socket
>> x86 server.
>> Though I suppose a lot depends on particular HW and actual workload.
>>
>>>
>>>>
>>>>> Furtrhermore, I think this is a tradeoff between performance and
>>>>> flexibility.
>>>>> Our goal is to achieve best performance, this means we need to give
>>>>> up some flexibility decisively. For example of 'FAST_FREE Mode', it
>>>>> deletes most of the buffer check (refcnt > 1, external buffer, chain
>>>>> buffer), chooses a shorest path, and then achieve significant
>>>>> performance
>>>> improvement.
>>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
>>>>>> similar to one we have for the ring?
>>>>>> It would allow us on TX free path to avoid copying mbufs to
>>>>>> temporary array on the stack.
>>>>>> Instead we can put them straight from TX SW ring to the mempool cache.
>>>>>> That should save extra store/load for mbuf and might help to
>>>>>> achieve some performance gain without by-passing mempool.
>>>>>> It probably wouldn't be as fast as what you proposing, but might be
>>>>>> fast enough to consider as alternative.
>>>>>> Again, it would be a generic one, so we can avoid all these
>>>>>> implications and limitations.
>>>>>
>>>>> [Feifei] I think this is a good try. However, the most important
>>>>> thing is that if we can bypass the mempool decisively to pursue the
>>>>> significant performance gains.
>>>>
>>>> I understand the intention, and I personally think this is wrong and
>>>> dangerous attitude.
>>>> We have mempool abstraction in place for very good reason.
>>>> So we need to try to improve mempool performance (and API if
>>>> necessary) at first place, not to avoid it and break our own rules and
>> recommendations.
>>> The abstraction can be thought of at a higher level. i.e. the driver manages the
>> buffer allocation/free and is hidden from the application. The application does
>> not need to be aware of how these changes are implemented.
>>>
>>>>
>>>>
>>>>> For ZC, there maybe a problem for it in i40e. The reason for that
>>>>> put Tx buffers into temporary is that i40e_tx_entry includes buffer
>>>>> pointer and index.
>>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we need
>>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we still
>>>>> can't avoid using a temporary stack to extract Tx buffer pointers.
>>>>
>>>> When talking about ZC API for mempool cache I meant something like:
>>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
>>>> uint32_t *nb_elem, uint32_t flags); void
>>>> mempool_cache_put_zc_finish(struct
>>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
>>>> user a pointer inside mp-cache where to put free elems and max number
>>>> of slots that can be safely filled.
>>>> _finish_ will update mc->len.
>>>> As an example:
>>>>
>>>> /* expect to free N mbufs */
>>>> uint32_t n = N;
>>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>>>>
>>>> /* free up to n elems */
>>>> for (i = 0; i != n; i++) {
>>>>
>>>> /* get next free mbuf from somewhere */
>>>> mb = extract_and_prefree_mbuf(...);
>>>>
>>>> /* no more free mbufs for now */
>>>> if (mb == NULL)
>>>> break;
>>>>
>>>> p[i] = mb;
>>>> }
>>>>
>>>> /* finalize ZC put, with _i_ freed elems */
>>>> mempool_cache_put_zc_finish(mc, i);
>>>>
>>>> That way, I think we can overcome the issue with i40e_tx_entry you
>>>> mentioned above. Plus it might be useful in other similar places.
>>>>
>>>> Another alternative is obviously to split i40e_tx_entry into two
>>>> structs (one for mbuf, second for its metadata) and have a separate
>>>> array for each of them.
>>>> Though with that approach we need to make sure no perf drops will be
>>>> introduced, plus probably more code changes will be required.
>>> Commit '5171b4ee6b6" already does this (in a different way), but just for
>> AVX512. Unfortunately, it does not record any performance improvements. We
>> could port this to Arm NEON and look at the performance.
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-06-03 23:32 ` Konstantin Ananyev
@ 2022-06-04 8:07 ` Morten Brørup
2022-06-29 21:58 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2022-06-04 8:07 UTC (permalink / raw)
To: Konstantin Ananyev, Honnappa Nagarahalli, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli
> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Saturday, 4 June 2022 01.32
>
> > <snip>
> >>>
> >>>>
> >>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> >>>> previously sent you email, but may not be that person. Learn why
> this
> >>>> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> >>>>
> >>>> 16/05/2022 07:10, Feifei Wang пишет:
> >>>>>
> >>>>>>> Currently, the transmit side frees the buffers into the lcore
> >>>>>>> cache and the receive side allocates buffers from the lcore
> cache.
> >>>>>>> The transmit side typically frees 32 buffers resulting in
> >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> 32
> >>>>>>> buffers and stores them in the receive side software ring,
> >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> lcore cache.
> >>>>>>>
> >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> >>>>>>> from the lcore cache. i.e. the receive side will free the
> buffers
> >>>>>>> from transmit side directly into it's software ring. This will
> >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> cache.
> >>>>>>> It also frees up the cache lines used by the lcore cache.
> >>>>>>>
> >>>>>>> However, this solution poses several constraints:
> >>>>>>>
> >>>>>>> 1)The receive queue needs to know which transmit queue it
> should
> >>>>>>> take the buffers from. The application logic decides which
> >>>>>>> transmit port to use to send out the packets. In many use cases
> >>>>>>> the NIC might have a single port ([1], [2], [3]), in which case
> a
> >>>>>>> given transmit queue is always mapped to a single receive queue
> >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> >>>>>>>
> >>>>>>> If the NIC has 2 ports (there are several references), then we
> >>>>>>> will have
> >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> configure.
> >>>>>>> However, if this is generalized to 'N' ports, the configuration
> >>>>>>> can be long. More over the PMD would have to scan a list of
> >>>>>>> transmit queues to pull the buffers from.
> >>>>>
> >>>>>> Just to re-iterate some generic concerns about this proposal:
> >>>>>> - We effectively link RX and TX queues - when this feature
> is enabled,
> >>>>>> user can't stop TX queue without stopping linked RX queue
> first.
> >>>>>> Right now user is free to start/stop any queues at his
> will.
> >>>>>> If that feature will allow to link queues from different
> ports,
> >>>>>> then even ports will become dependent and user will have
> to pay extra
> >>>>>> care when managing such ports.
> >>>>>
> >>>>> [Feifei] When direct rearm enabled, there are two path for thread
> to
> >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> >>>>> from Tx.
> >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users
> do
> >>>>> not need to pay much attention managing ports.
> >>>>
> >>>> What I am talking about: right now different port or different
> queues
> >>>> of the same port can be treated as independent entities:
> >>>> in general user is free to start/stop (and even reconfigure in
> some
> >>>> cases) one entity without need to stop other entity.
> >>>> I.E user can stop and re-configure TX queue while keep receiving
> >>>> packets from RX queue.
> >>>> With direct re-arm enabled, I think it wouldn't be possible any
> more:
> >>>> before stopping/reconfiguring TX queue user would have make sure
> that
> >>>> corresponding RX queue wouldn't be used by datapath.
> >>> I am trying to understand the problem better. For the TX queue to
> be stopped,
> >> the user must have blocked the data plane from accessing the TX
> queue.
> >>
> >> Surely it is user responsibility tnot to call tx_burst() for
> stopped/released queue.
> >> The problem is that while TX for that queue is stopped, RX for
> related queue still
> >> can continue.
> >> So rx_burst() will try to read/modify TX queue data, that might be
> already freed,
> >> or simultaneously modified by control path.
> > Understood, agree on the issue
> >
> >>
> >> Again, it all can be mitigated by carefully re-designing and
> modifying control and
> >> data-path inside user app - by doing extra checks and
> synchronizations, etc.
> >> But from practical point - I presume most of users simply would
> avoid using this
> >> feature due all potential problems it might cause.
> > That is subjective, it all depends on the performance improvements
> users see in their application.
> > IMO, the performance improvement seen with this patch is worth few
> changes.
>
> Yes, it is subjective till some extent, though my feeling
> that it might end-up being sort of synthetic improvement used only
> by some show-case benchmarks.
I believe that one specific important use case has already been mentioned, so I don't think this is a benchmark only feature.
> From my perspective, it would be much more plausible,
> if we can introduce some sort of generic improvement, that doesn't
> impose all these extra constraints and implications.
> Like one, discussed below in that thread with ZC mempool approach.
>
Considering this feature from a high level perspective, I agree with Konstantin's concerns, so I'll also support his views.
If this patch is supposed to be a generic feature, please add support for it in all NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping or N:M mapping.) It is purely software, so it should be available for all PMDs, not just your favorite hardware! Consider the "fast mbuf free" feature, which is pure software; why is that feature not implemented in all PMDs?
A secondary point I'm making here is that this specific feature will lead to an enormous amount of copy-paste code, instead of a generic library function easily available for all PMDs.
>
> >>
> >>> Like Feifei says, the RX side has the normal packet allocation path
> still available.
> >>> Also this sounds like a corner case to me, we can handle this
> through checks in
> >> the queue_stop API.
> >>
> >> Depends.
> >> if it would be allowed to link queues only from the same port, then
> yes, extra
> >> checks for queue-stop might be enough.
> >> As right now DPDK doesn't allow user to change number of queues
> without
> >> dev_stop() first.
> >> Though if it would be allowed to link queues from different ports,
> then situation
> >> will be much worse.
> >> Right now ports are totally independent entities (except some
> special cases like
> >> link-bonding, etc.).
> >> As one port can keep doing RX/TX, second one can be stopped, re-
> confgured,
> >> even detached, and newly attached device might re-use same port
> number.
> > I see this as a similar restriction to the one discussed above.
>
> Yes, they are similar in principal, though I think that the case
> with queues from different port would make things much more complex.
>
> > Do you see any issues if we enforce this with checks?
>
> Hard to tell straightway, a lot will depend how smart
> such implementation would be.
> Usually DPDK tends not to avoid heavy
> synchronizations within its data-path functions.
Certainly! Implementing more and more of such features in the PMDs will lead to longer and longer data plane code paths in the PMDs. It is the "salami method", where each small piece makes no performance difference, but they all add up, and eventually the sum of them does impact the performance of the general use case negatively.
>
> >>
> >>
> >>>>
> >>>>>
> >>>>>> - very limited usage scenario - it will have a positive effect
> only
> >>>>>> when we have a fixed forwarding mapping: all (or nearly
> all) packets
> >>>>>> from the RX queue are forwarded into the same TX queue.
> >>>>>
> >>>>> [Feifei] Although the usage scenario is limited, this usage
> scenario
> >>>>> has a wide range of applications, such as NIC with one port.
> >>>>
> >>>> yes, there are NICs with one port, but no guarantee there wouldn't
> be
> >>>> several such NICs within the system.
> >>> What I see in my interactions is, a single NIC/DPU is under
> utilized for a 2
> >> socket system. Some are adding more sockets to the system to better
> utilize the
> >> DPU. The NIC bandwidth continues to grow significantly. I do not
> think there will
> >> be a multi-DPU per server scenario.
> >>
> >>
> >> Interesting... from my experience it is visa-versa:
> >> in many cases 200Gb/s is not that much these days to saturate modern
> 2 socket
> >> x86 server.
> >> Though I suppose a lot depends on particular HW and actual workload.
> >>
> >>>
> >>>>
> >>>>> Furtrhermore, I think this is a tradeoff between performance and
> >>>>> flexibility.
> >>>>> Our goal is to achieve best performance, this means we need to
> give
> >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> it
> >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> chain
> >>>>> buffer), chooses a shorest path, and then achieve significant
> >>>>> performance
> >>>> improvement.
> >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> >>>>>> similar to one we have for the ring?
> >>>>>> It would allow us on TX free path to avoid copying mbufs to
> >>>>>> temporary array on the stack.
> >>>>>> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >>>>>> That should save extra store/load for mbuf and might help to
> >>>>>> achieve some performance gain without by-passing mempool.
> >>>>>> It probably wouldn't be as fast as what you proposing, but might
> be
> >>>>>> fast enough to consider as alternative.
> >>>>>> Again, it would be a generic one, so we can avoid all these
> >>>>>> implications and limitations.
> >>>>>
> >>>>> [Feifei] I think this is a good try. However, the most important
> >>>>> thing is that if we can bypass the mempool decisively to pursue
> the
> >>>>> significant performance gains.
> >>>>
> >>>> I understand the intention, and I personally think this is wrong
> and
> >>>> dangerous attitude.
> >>>> We have mempool abstraction in place for very good reason.
> >>>> So we need to try to improve mempool performance (and API if
> >>>> necessary) at first place, not to avoid it and break our own rules
> and
> >> recommendations.
> >>> The abstraction can be thought of at a higher level. i.e. the
> driver manages the
> >> buffer allocation/free and is hidden from the application. The
> application does
> >> not need to be aware of how these changes are implemented.
> >>>
> >>>>
> >>>>
> >>>>> For ZC, there maybe a problem for it in i40e. The reason for that
> >>>>> put Tx buffers into temporary is that i40e_tx_entry includes
> buffer
> >>>>> pointer and index.
> >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> need
> >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> still
> >>>>> can't avoid using a temporary stack to extract Tx buffer
> pointers.
> >>>>
> >>>> When talking about ZC API for mempool cache I meant something
> like:
> >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> >>>> uint32_t *nb_elem, uint32_t flags); void
> >>>> mempool_cache_put_zc_finish(struct
> >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
> >>>> user a pointer inside mp-cache where to put free elems and max
> number
> >>>> of slots that can be safely filled.
> >>>> _finish_ will update mc->len.
> >>>> As an example:
> >>>>
> >>>> /* expect to free N mbufs */
> >>>> uint32_t n = N;
> >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> >>>>
> >>>> /* free up to n elems */
> >>>> for (i = 0; i != n; i++) {
> >>>>
> >>>> /* get next free mbuf from somewhere */
> >>>> mb = extract_and_prefree_mbuf(...);
> >>>>
> >>>> /* no more free mbufs for now */
> >>>> if (mb == NULL)
> >>>> break;
> >>>>
> >>>> p[i] = mb;
> >>>> }
> >>>>
> >>>> /* finalize ZC put, with _i_ freed elems */
> >>>> mempool_cache_put_zc_finish(mc, i);
> >>>>
> >>>> That way, I think we can overcome the issue with i40e_tx_entry you
> >>>> mentioned above. Plus it might be useful in other similar places.
> >>>>
> >>>> Another alternative is obviously to split i40e_tx_entry into two
> >>>> structs (one for mbuf, second for its metadata) and have a
> separate
> >>>> array for each of them.
> >>>> Though with that approach we need to make sure no perf drops will
> be
> >>>> introduced, plus probably more code changes will be required.
> >>> Commit '5171b4ee6b6" already does this (in a different way), but
> just for
> >> AVX512. Unfortunately, it does not record any performance
> improvements. We
> >> could port this to Arm NEON and look at the performance.
> >
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-05-31 17:14 ` Honnappa Nagarahalli
2022-06-03 10:32 ` Andrew Rybchenko
@ 2022-06-06 11:27 ` Konstantin Ananyev
2022-06-29 21:25 ` Honnappa Nagarahalli
1 sibling, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2022-06-06 11:27 UTC (permalink / raw)
To: Honnappa Nagarahalli, dev
Cc: honnappanagarahalli, Feifei Wang, Ruifeng Wang, nd
31/05/2022 18:14, Honnappa Nagarahalli пишет:
> <snip>
>>
>> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
>>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>>>
>>> 20/04/2022 09:16, Feifei Wang пишет:
>>>>> Enable direct rearm mode. The mapping is decided in the data plane
>>>>> based on the first packet received.
>>>>>
>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>> ---
>>>>> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
>>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c b/examples/l3fwd/l3fwd_lpm.c
>>>>> index bec22c44cd..38ffdf4636 100644
>>>>> --- a/examples/l3fwd/l3fwd_lpm.c
>>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
>>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> unsigned lcore_id;
>>>>> uint64_t prev_tsc, diff_tsc, cur_tsc;
>>>>> int i, nb_rx;
>>>>> - uint16_t portid;
>>>>> + uint16_t portid, tx_portid;
>>>>> uint8_t queueid;
>>>>> struct lcore_conf *qconf;
>>>>> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) /
>>>>> @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> const uint16_t n_rx_q = qconf->n_rx_queue;
>>>>> const uint16_t n_tx_p = qconf->n_tx_port;
>>>>> + int direct_rearm_map[n_rx_q];
>>>>> +
>>>>> if (n_rx_q == 0) {
>>>>> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
>>>>> lcore_id);
>>>>> return 0;
>>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
>>>>> portid = qconf->rx_queue_list[i].port_id;
>>>>> queueid = qconf->rx_queue_list[i].queue_id;
>>>>> + direct_rearm_map[i] = 0;
>>>>> RTE_LOG(INFO, L3FWD,
>>>>> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
>>>>> lcore_id, portid, queueid); @@ -209,6 +212,17 @@
>>>>> lpm_main_loop(__rte_unused void *dummy)
>>>>> if (nb_rx == 0)
>>>>> continue;
>>>>> + /* Determine the direct rearm mapping based on the
>>>>> +first
>>>>> + * packet received on the rx queue
>>>>> + */
>>>>> + if (direct_rearm_map[i] == 0) {
>>>>> + tx_portid = lpm_get_dst_port(qconf, pkts_burst[0],
>>>>> + portid);
>>>>> + rte_eth_direct_rxrearm_map(portid, queueid,
>>>>> + tx_portid, queueid);
>>>>> + direct_rearm_map[i] = 1;
>>>>> + }
>>>>> +
>>>
>>>> That just doesn't look right to me: why to make decision based on the
>>>> first packet?
>>> The TX queue depends on the incoming packet. So, this method covers
>>> more scenarios than doing it in the control plane where the outgoing
>>> queue is not known.
>>>
>>>
>>>> What would happen if second and all other packets have to be routed
>>>> to different ports?
>>> This is an example application and it should be fine to make this
>>> assumption.
>>> More over, it does not cause any problems if packets change in between.
>>> When
>>> the packets change back, the feature works again.
>>>
>>>> In fact, this direct-rearm mode seems suitable only for hard-coded
>>>> one to one mapped forwarding (examples/l2fwd, testpmd).
>>>> For l3fwd it can be used safely only when we have one port in use.
>>> Can you elaborate more on the safety issue when more than one port is
>> used?
>>>
>>>> Also I think it should be selected at init-time and it shouldn't be
>>>> on by default.
>>>> To summarize, my opinion:
>>>> special cmd-line parameter to enable it.
>>> Can you please elaborate why a command line parameter is required?
>>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
>>> enabled without a command line parameter. IMO, this is how it should
>>> ber. Essentially we are trying to measure how different PMDs perform,
>>> the ones that have implemented performance improvement features
>> would
>>> show better performance (i.e. the PMDs implementing the features
>>> should not be penalized by asking for additional user input).
>>
>> From my perspective, main purpose of l3fwd application is to demonstrate
>> DPDK ability to do packet routing based on input packet contents.
>> Making guesses about packet contents is a change in expected behavior.
>> For some cases it might improve performance, for many others - will most
>> likely cause performance drop.
>> I think that performance drop as default behavior (running the same
>> parameters as before) should not be allowed.
>> Plus you did not provided ability to switch off that behavior, if undesired.
> There is no drop in L3fwd performance due to this patch.
Hmm..
Are you saying even when your guess is wrong, and you constantly hitting
slow-path (check tx_queue first - failure, then allocate from mempool)
you didn't observe any performance drop?
There is more work to do, and if workload is cpu-bound,
my guess - it should be noticeable.
Also, from previous experience, quite often even after
tiny changes in rx/tx code-path some slowdown was reported.
Usually that happened on some low-end ARM cpus (Marvell, NXP).
>>
>> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
>> enablement - I don't think it is correct.
>> Within l3fwd app we can safely guarantee that all
>> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
>> in each TX queue all mbufs will belong to the same mempool and their refcnt
>> will always equal to one.
>> Here you are making guesses about contents of input packets, without any
>> guarantee that you guess will always be valid.
> This is not a guess. The code understands the incoming traffic and configures accordingly. So, it should be correct.
No, it is not.
Your code makes guess about all incoming traffic
based on just one (first) packet.
Since it is a sample application, we do not expect the traffic to be
complex. If it is complex, the performance will be the same as before or
better.
I am not talking about anything 'complex' here.
Doing dynamic routing based on packet contents is a
a basic l3fwd functionality.
>
>>
>> BTW, what's wrong with using l2fwd to demonstrate that feature?
>> Seems like a natural choice to me.
> The performance of L3fwd application in DPDK has become a industry standard, hence we need to showcase the performance in L3fwd application.
>
>>
>>>> allowable only when we run l3fwd over one port.
>>>
>>>
>>>>> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
>>>>> || defined RTE_ARCH_PPC_64
>>>>> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-05-24 1:25 ` Konstantin Ananyev
2022-05-24 12:40 ` Morten Brørup
2022-05-24 20:14 ` Honnappa Nagarahalli
@ 2022-06-13 5:55 ` Feifei Wang
2 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2022-06-13 5:55 UTC (permalink / raw)
To: Konstantin Ananyev; +Cc: nd, dev, Ruifeng Wang, Honnappa Nagarahalli, nd
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Tuesday, May 24, 2022 9:26 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: nd <nd@arm.com>; dev@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> 主题: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
>
> [konstantin.v.ananyev@yandex.ru appears similar to someone who previously
> sent you email, but may not be that person. Learn why this could be a risk at
> https://aka.ms/LearnAboutSenderIdentification.]
>
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >> - We effectively link RX and TX queues - when this feature is enabled,
> >> user can't stop TX queue without stopping linked RX queue first.
> >> Right now user is free to start/stop any queues at his will.
> >> If that feature will allow to link queues from different ports,
> >> then even ports will become dependent and user will have to pay extra
> >> care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
>
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets
> from RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
>
> >
> >> - very limited usage scenario - it will have a positive effect only
> >> when we have a fixed forwarding mapping: all (or nearly all) packets
> >> from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
>
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
>
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
>
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
>
>
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
>
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void
> mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem); i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
>
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>
> /* free up to n elems */
> for (i = 0; i != n; i++) {
>
> /* get next free mbuf from somewhere */
> mb = extract_and_prefree_mbuf(...);
>
> /* no more free mbufs for now */
> if (mb == NULL)
> break;
>
> p[i] = mb;
> }
>
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
>
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
>
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.
[Feifei] I just uploaded a RFC patch to the community:
http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wang2@arm.com/
I do not use ZC api, and refer to i40e avx512 path, directly put mempool cache out of API.
You can take a look when you are free. Thanks very much.
>
>
>
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 5/5] examples/l3fwd: enable direct rearm mode
2022-06-06 11:27 ` Konstantin Ananyev
@ 2022-06-29 21:25 ` Honnappa Nagarahalli
0 siblings, 0 replies; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-29 21:25 UTC (permalink / raw)
To: Konstantin Ananyev, dev
Cc: honnappanagarahalli, Feifei Wang, Ruifeng Wang, nd, nd
(apologies for being late on replying, catching up after vacation)
<snip>
> >>
> >> 25/05/2022 01:24, Honnappa Nagarahalli пишет:
> >>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> >>>
> >>> 20/04/2022 09:16, Feifei Wang пишет:
> >>>>> Enable direct rearm mode. The mapping is decided in the data plane
> >>>>> based on the first packet received.
> >>>>>
> >>>>> Suggested-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>>> ---
> >>>>> examples/l3fwd/l3fwd_lpm.c | 16 +++++++++++++++-
> >>>>> 1 file changed, 15 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/examples/l3fwd/l3fwd_lpm.c
> >>>>> b/examples/l3fwd/l3fwd_lpm.c index bec22c44cd..38ffdf4636 100644
> >>>>> --- a/examples/l3fwd/l3fwd_lpm.c
> >>>>> +++ b/examples/l3fwd/l3fwd_lpm.c
> >>>>> @@ -147,7 +147,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>>> unsigned lcore_id;
> >>>>> uint64_t prev_tsc, diff_tsc, cur_tsc;
> >>>>> int i, nb_rx;
> >>>>> - uint16_t portid;
> >>>>> + uint16_t portid, tx_portid;
> >>>>> uint8_t queueid;
> >>>>> struct lcore_conf *qconf;
> >>>>> const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S -
> >>>>> 1) / @@ -158,6 +158,8 @@ lpm_main_loop(__rte_unused void
> *dummy)
> >>>>> const uint16_t n_rx_q = qconf->n_rx_queue;
> >>>>> const uint16_t n_tx_p = qconf->n_tx_port;
> >>>>> + int direct_rearm_map[n_rx_q];
> >>>>> +
> >>>>> if (n_rx_q == 0) {
> >>>>> RTE_LOG(INFO, L3FWD, "lcore %u has nothing to do\n",
> >>>>> lcore_id);
> >>>>> return 0;
> >>>>> @@ -169,6 +171,7 @@ lpm_main_loop(__rte_unused void *dummy)
> >>>>> portid = qconf->rx_queue_list[i].port_id;
> >>>>> queueid = qconf->rx_queue_list[i].queue_id;
> >>>>> + direct_rearm_map[i] = 0;
> >>>>> RTE_LOG(INFO, L3FWD,
> >>>>> " -- lcoreid=%u portid=%u rxqueueid=%hhu\n",
> >>>>> lcore_id, portid, queueid); @@ -209,6 +212,17 @@
> >>>>> lpm_main_loop(__rte_unused void *dummy)
> >>>>> if (nb_rx == 0)
> >>>>> continue;
> >>>>> + /* Determine the direct rearm mapping based on the
> >>>>> +first
> >>>>> + * packet received on the rx queue
> >>>>> + */
> >>>>> + if (direct_rearm_map[i] == 0) {
> >>>>> + tx_portid = lpm_get_dst_port(qconf,
> >>>>> +pkts_burst[0],
> >>>>> + portid);
> >>>>> + rte_eth_direct_rxrearm_map(portid, queueid,
> >>>>> + tx_portid, queueid);
> >>>>> + direct_rearm_map[i] = 1;
> >>>>> + }
> >>>>> +
> >>>
> >>>> That just doesn't look right to me: why to make decision based on
> >>>> the first packet?
> >>> The TX queue depends on the incoming packet. So, this method covers
> >>> more scenarios than doing it in the control plane where the outgoing
> >>> queue is not known.
> >>>
> >>>
> >>>> What would happen if second and all other packets have to be routed
> >>>> to different ports?
> >>> This is an example application and it should be fine to make this
> >>> assumption.
> >>> More over, it does not cause any problems if packets change in between.
> >>> When
> >>> the packets change back, the feature works again.
> >>>
> >>>> In fact, this direct-rearm mode seems suitable only for hard-coded
> >>>> one to one mapped forwarding (examples/l2fwd, testpmd).
> >>>> For l3fwd it can be used safely only when we have one port in use.
> >>> Can you elaborate more on the safety issue when more than one port
> >>> is
> >> used?
> >>>
> >>>> Also I think it should be selected at init-time and it shouldn't be
> >>>> on by default.
> >>>> To summarize, my opinion:
> >>>> special cmd-line parameter to enable it.
> >>> Can you please elaborate why a command line parameter is required?
> >>> Other similar features like RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE are
> >>> enabled without a command line parameter. IMO, this is how it should
> >>> ber. Essentially we are trying to measure how different PMDs
> >>> perform, the ones that have implemented performance improvement
> >>> features
> >> would
> >>> show better performance (i.e. the PMDs implementing the features
> >>> should not be penalized by asking for additional user input).
> >>
> >> From my perspective, main purpose of l3fwd application is to
> >> demonstrate DPDK ability to do packet routing based on input packet
> contents.
> >> Making guesses about packet contents is a change in expected behavior.
> >> For some cases it might improve performance, for many others - will
> >> most likely cause performance drop.
> >> I think that performance drop as default behavior (running the same
> >> parameters as before) should not be allowed.
> >> Plus you did not provided ability to switch off that behavior, if undesired.
> > There is no drop in L3fwd performance due to this patch.
>
> Hmm..
> Are you saying even when your guess is wrong, and you constantly hitting slow-
> path (check tx_queue first - failure, then allocate from mempool) you didn't
> observe any performance drop?
> There is more work to do, and if workload is cpu-bound, my guess - it should be
> noticeable.
Well, you do not have to trust our words. The platform used to test is mentioned in the cover letter. Appreciate if you could test on your platform and provide the results.
> Also, from previous experience, quite often even after tiny changes in rx/tx
> code-path some slowdown was reported.
> Usually that happened on some low-end ARM cpus (Marvell, NXP).
>
>
> >>
> >> About comparison with RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE default
> >> enablement - I don't think it is correct.
> >> Within l3fwd app we can safely guarantee that all
> >> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE pre-requirements are met:
> >> in each TX queue all mbufs will belong to the same mempool and their
> >> refcnt will always equal to one.
> >> Here you are making guesses about contents of input packets, without
> >> any guarantee that you guess will always be valid.
> > This is not a guess. The code understands the incoming traffic and configures
> accordingly. So, it should be correct.
>
>
> No, it is not.
> Your code makes guess about all incoming traffic based on just one (first)
> packet.
>
> Since it is a sample application, we do not expect the traffic to be complex. If it is
> complex, the performance will be the same as before or better.
>
> I am not talking about anything 'complex' here.
> Doing dynamic routing based on packet contents is a a basic l3fwd functionality.
>
> >
> >>
> >> BTW, what's wrong with using l2fwd to demonstrate that feature?
> >> Seems like a natural choice to me.
> > The performance of L3fwd application in DPDK has become a industry
> standard, hence we need to showcase the performance in L3fwd application.
> >
> >>
> >>>> allowable only when we run l3fwd over one port.
> >>>
> >>>
> >>>>> #if defined RTE_ARCH_X86 || defined __ARM_NEON \
> >>>>> || defined RTE_ARCH_PPC_64
> >>>>> l3fwd_lpm_send_packets(nb_rx, pkts_burst,
> >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-06-04 8:07 ` Morten Brørup
@ 2022-06-29 21:58 ` Honnappa Nagarahalli
2022-06-30 15:21 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-29 21:58 UTC (permalink / raw)
To: Morten Brørup, Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
<snip>
> > >>>
> > >>>>
> > >>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> > >>>> previously sent you email, but may not be that person. Learn why
> > this
> > >>>> could be a risk at
> > >>>> https://aka.ms/LearnAboutSenderIdentification.]
> > >>>>
> > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > >>>>>
> > >>>>>>> Currently, the transmit side frees the buffers into the lcore
> > >>>>>>> cache and the receive side allocates buffers from the lcore
> > cache.
> > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> > 32
> > >>>>>>> buffers and stores them in the receive side software ring,
> > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > lcore cache.
> > >>>>>>>
> > >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > buffers
> > >>>>>>> from transmit side directly into it's software ring. This will
> > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > cache.
> > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > >>>>>>>
> > >>>>>>> However, this solution poses several constraints:
> > >>>>>>>
> > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > should
> > >>>>>>> take the buffers from. The application logic decides which
> > >>>>>>> transmit port to use to send out the packets. In many use
> > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > >>>>>>> which case
> > a
> > >>>>>>> given transmit queue is always mapped to a single receive
> > >>>>>>> queue
> > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > >>>>>>>
> > >>>>>>> If the NIC has 2 ports (there are several references), then we
> > >>>>>>> will have
> > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > configure.
> > >>>>>>> However, if this is generalized to 'N' ports, the
> > >>>>>>> configuration can be long. More over the PMD would have to
> > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > >>>>>
> > >>>>>> Just to re-iterate some generic concerns about this proposal:
> > >>>>>> - We effectively link RX and TX queues - when this feature
> > is enabled,
> > >>>>>> user can't stop TX queue without stopping linked RX queue
> > first.
> > >>>>>> Right now user is free to start/stop any queues at his
> > will.
> > >>>>>> If that feature will allow to link queues from different
> > ports,
> > >>>>>> then even ports will become dependent and user will have
> > to pay extra
> > >>>>>> care when managing such ports.
> > >>>>>
> > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > >>>>> thread
> > to
> > >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> > >>>>> from Tx.
> > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > >>>>> users
> > do
> > >>>>> not need to pay much attention managing ports.
> > >>>>
> > >>>> What I am talking about: right now different port or different
> > queues
> > >>>> of the same port can be treated as independent entities:
> > >>>> in general user is free to start/stop (and even reconfigure in
> > some
> > >>>> cases) one entity without need to stop other entity.
> > >>>> I.E user can stop and re-configure TX queue while keep receiving
> > >>>> packets from RX queue.
> > >>>> With direct re-arm enabled, I think it wouldn't be possible any
> > more:
> > >>>> before stopping/reconfiguring TX queue user would have make sure
> > that
> > >>>> corresponding RX queue wouldn't be used by datapath.
> > >>> I am trying to understand the problem better. For the TX queue to
> > be stopped,
> > >> the user must have blocked the data plane from accessing the TX
> > queue.
> > >>
> > >> Surely it is user responsibility tnot to call tx_burst() for
> > stopped/released queue.
> > >> The problem is that while TX for that queue is stopped, RX for
> > related queue still
> > >> can continue.
> > >> So rx_burst() will try to read/modify TX queue data, that might be
> > already freed,
> > >> or simultaneously modified by control path.
> > > Understood, agree on the issue
> > >
> > >>
> > >> Again, it all can be mitigated by carefully re-designing and
> > modifying control and
> > >> data-path inside user app - by doing extra checks and
> > synchronizations, etc.
> > >> But from practical point - I presume most of users simply would
> > avoid using this
> > >> feature due all potential problems it might cause.
> > > That is subjective, it all depends on the performance improvements
> > users see in their application.
> > > IMO, the performance improvement seen with this patch is worth few
> > changes.
> >
> > Yes, it is subjective till some extent, though my feeling that it
> > might end-up being sort of synthetic improvement used only by some
> > show-case benchmarks.
>
> I believe that one specific important use case has already been mentioned, so I
> don't think this is a benchmark only feature.
+1
>
> > From my perspective, it would be much more plausible, if we can
> > introduce some sort of generic improvement, that doesn't impose all
> > these extra constraints and implications.
> > Like one, discussed below in that thread with ZC mempool approach.
> >
>
> Considering this feature from a high level perspective, I agree with Konstantin's
> concerns, so I'll also support his views.
We did hack the ZC mempool approach [1], level of improvement is pretty small compared with this patch.
[1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wang2@arm.com/
>
> If this patch is supposed to be a generic feature, please add support for it in all
> NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping or
> N:M mapping.) It is purely software, so it should be available for all PMDs, not
> just your favorite hardware! Consider the "fast mbuf free" feature, which is
> pure software; why is that feature not implemented in all PMDs?
Agree, it is good to have it supported in all the drivers. We do not have a favorite hardware, just picked a PMD which we are more familiar with. We do plan to implement in other prominent PMDs.
>
> A secondary point I'm making here is that this specific feature will lead to an
> enormous amount of copy-paste code, instead of a generic library function
> easily available for all PMDs.
Are you talking about the i40e driver code in specific? If yes, agree we should avoid copy-paste and we will look to reduce that.
>
> >
> > >>
> > >>> Like Feifei says, the RX side has the normal packet allocation
> > >>> path
> > still available.
> > >>> Also this sounds like a corner case to me, we can handle this
> > through checks in
> > >> the queue_stop API.
> > >>
> > >> Depends.
> > >> if it would be allowed to link queues only from the same port, then
> > yes, extra
> > >> checks for queue-stop might be enough.
> > >> As right now DPDK doesn't allow user to change number of queues
> > without
> > >> dev_stop() first.
> > >> Though if it would be allowed to link queues from different ports,
> > then situation
> > >> will be much worse.
> > >> Right now ports are totally independent entities (except some
> > special cases like
> > >> link-bonding, etc.).
> > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > confgured,
> > >> even detached, and newly attached device might re-use same port
> > number.
> > > I see this as a similar restriction to the one discussed above.
> >
> > Yes, they are similar in principal, though I think that the case with
> > queues from different port would make things much more complex.
> >
> > > Do you see any issues if we enforce this with checks?
> >
> > Hard to tell straightway, a lot will depend how smart such
> > implementation would be.
> > Usually DPDK tends not to avoid heavy
> > synchronizations within its data-path functions.
>
> Certainly! Implementing more and more of such features in the PMDs will lead
> to longer and longer data plane code paths in the PMDs. It is the "salami
> method", where each small piece makes no performance difference, but they all
> add up, and eventually the sum of them does impact the performance of the
> general use case negatively.
It would be good to have a test running in UNH that shows the performance trend.
>
> >
> > >>
> > >>
> > >>>>
> > >>>>>
> > >>>>>> - very limited usage scenario - it will have a positive effect
> > only
> > >>>>>> when we have a fixed forwarding mapping: all (or nearly
> > all) packets
> > >>>>>> from the RX queue are forwarded into the same TX queue.
> > >>>>>
> > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > scenario
> > >>>>> has a wide range of applications, such as NIC with one port.
> > >>>>
> > >>>> yes, there are NICs with one port, but no guarantee there
> > >>>> wouldn't
> > be
> > >>>> several such NICs within the system.
> > >>> What I see in my interactions is, a single NIC/DPU is under
> > utilized for a 2
> > >> socket system. Some are adding more sockets to the system to better
> > utilize the
> > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > think there will
> > >> be a multi-DPU per server scenario.
> > >>
> > >>
> > >> Interesting... from my experience it is visa-versa:
> > >> in many cases 200Gb/s is not that much these days to saturate
> > >> modern
> > 2 socket
> > >> x86 server.
> > >> Though I suppose a lot depends on particular HW and actual workload.
> > >>
> > >>>
> > >>>>
> > >>>>> Furtrhermore, I think this is a tradeoff between performance and
> > >>>>> flexibility.
> > >>>>> Our goal is to achieve best performance, this means we need to
> > give
> > >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> > it
> > >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> > chain
> > >>>>> buffer), chooses a shorest path, and then achieve significant
> > >>>>> performance
> > >>>> improvement.
> > >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> > >>>>>> similar to one we have for the ring?
> > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > >>>>>> temporary array on the stack.
> > >>>>>> Instead we can put them straight from TX SW ring to the mempool
> > cache.
> > >>>>>> That should save extra store/load for mbuf and might help to
> > >>>>>> achieve some performance gain without by-passing mempool.
> > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > >>>>>> might
> > be
> > >>>>>> fast enough to consider as alternative.
> > >>>>>> Again, it would be a generic one, so we can avoid all these
> > >>>>>> implications and limitations.
> > >>>>>
> > >>>>> [Feifei] I think this is a good try. However, the most important
> > >>>>> thing is that if we can bypass the mempool decisively to pursue
> > the
> > >>>>> significant performance gains.
> > >>>>
> > >>>> I understand the intention, and I personally think this is wrong
> > and
> > >>>> dangerous attitude.
> > >>>> We have mempool abstraction in place for very good reason.
> > >>>> So we need to try to improve mempool performance (and API if
> > >>>> necessary) at first place, not to avoid it and break our own
> > >>>> rules
> > and
> > >> recommendations.
> > >>> The abstraction can be thought of at a higher level. i.e. the
> > driver manages the
> > >> buffer allocation/free and is hidden from the application. The
> > application does
> > >> not need to be aware of how these changes are implemented.
> > >>>
> > >>>>
> > >>>>
> > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > >>>>> includes
> > buffer
> > >>>>> pointer and index.
> > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > need
> > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> > still
> > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > pointers.
> > >>>>
> > >>>> When talking about ZC API for mempool cache I meant something
> > like:
> > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> > >>>> uint32_t *nb_elem, uint32_t flags); void
> > >>>> mempool_cache_put_zc_finish(struct
> > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > >>>> return user a pointer inside mp-cache where to put free elems and
> > >>>> max
> > number
> > >>>> of slots that can be safely filled.
> > >>>> _finish_ will update mc->len.
> > >>>> As an example:
> > >>>>
> > >>>> /* expect to free N mbufs */
> > >>>> uint32_t n = N;
> > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > >>>>
> > >>>> /* free up to n elems */
> > >>>> for (i = 0; i != n; i++) {
> > >>>>
> > >>>> /* get next free mbuf from somewhere */
> > >>>> mb = extract_and_prefree_mbuf(...);
> > >>>>
> > >>>> /* no more free mbufs for now */
> > >>>> if (mb == NULL)
> > >>>> break;
> > >>>>
> > >>>> p[i] = mb;
> > >>>> }
> > >>>>
> > >>>> /* finalize ZC put, with _i_ freed elems */
> > >>>> mempool_cache_put_zc_finish(mc, i);
> > >>>>
> > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > >>>> you mentioned above. Plus it might be useful in other similar places.
> > >>>>
> > >>>> Another alternative is obviously to split i40e_tx_entry into two
> > >>>> structs (one for mbuf, second for its metadata) and have a
> > separate
> > >>>> array for each of them.
> > >>>> Though with that approach we need to make sure no perf drops will
> > be
> > >>>> introduced, plus probably more code changes will be required.
> > >>> Commit '5171b4ee6b6" already does this (in a different way), but
> > just for
> > >> AVX512. Unfortunately, it does not record any performance
> > improvements. We
> > >> could port this to Arm NEON and look at the performance.
> > >
> >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-06-29 21:58 ` Honnappa Nagarahalli
@ 2022-06-30 15:21 ` Morten Brørup
2022-07-01 19:30 ` Honnappa Nagarahalli
0 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2022-06-30 15:21 UTC (permalink / raw)
To: Honnappa Nagarahalli, Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 29 June 2022 23.58
>
> <snip>
>
> > > >>>>
> > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > >>>>>
> > > >>>>>>> Currently, the transmit side frees the buffers into the
> lcore
> > > >>>>>>> cache and the receive side allocates buffers from the lcore
> > > cache.
> > > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> allocates
> > > 32
> > > >>>>>>> buffers and stores them in the receive side software ring,
> > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > > lcore cache.
> > > >>>>>>>
> > > >>>>>>> This patch proposes a mechanism to avoid freeing
> to/allocating
> > > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > > buffers
> > > >>>>>>> from transmit side directly into it's software ring. This
> will
> > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > > cache.
> > > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > > >>>>>>>
> > > >>>>>>> However, this solution poses several constraints:
> > > >>>>>>>
> > > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > > should
> > > >>>>>>> take the buffers from. The application logic decides which
> > > >>>>>>> transmit port to use to send out the packets. In many use
> > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > > >>>>>>> which case
> > > a
> > > >>>>>>> given transmit queue is always mapped to a single receive
> > > >>>>>>> queue
> > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > >>>>>>>
> > > >>>>>>> If the NIC has 2 ports (there are several references), then
> we
> > > >>>>>>> will have
> > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > configure.
> > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > >>>>>>> configuration can be long. More over the PMD would have to
> > > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > > >>>>>
> > > >>>>>> Just to re-iterate some generic concerns about this
> proposal:
> > > >>>>>> - We effectively link RX and TX queues - when this
> feature
> > > is enabled,
> > > >>>>>> user can't stop TX queue without stopping linked RX
> queue
> > > first.
> > > >>>>>> Right now user is free to start/stop any queues at his
> > > will.
> > > >>>>>> If that feature will allow to link queues from
> different
> > > ports,
> > > >>>>>> then even ports will become dependent and user will
> have
> > > to pay extra
> > > >>>>>> care when managing such ports.
> > > >>>>>
> > > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > > >>>>> thread
> > > to
> > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> buffers
> > > >>>>> from Tx.
> > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > > >>>>> users
> > > do
> > > >>>>> not need to pay much attention managing ports.
> > > >>>>
> > > >>>> What I am talking about: right now different port or different
> > > queues
> > > >>>> of the same port can be treated as independent entities:
> > > >>>> in general user is free to start/stop (and even reconfigure in
> > > some
> > > >>>> cases) one entity without need to stop other entity.
> > > >>>> I.E user can stop and re-configure TX queue while keep
> receiving
> > > >>>> packets from RX queue.
> > > >>>> With direct re-arm enabled, I think it wouldn't be possible
> any
> > > more:
> > > >>>> before stopping/reconfiguring TX queue user would have make
> sure
> > > that
> > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > >>> I am trying to understand the problem better. For the TX queue
> to
> > > be stopped,
> > > >> the user must have blocked the data plane from accessing the TX
> > > queue.
> > > >>
> > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > stopped/released queue.
> > > >> The problem is that while TX for that queue is stopped, RX for
> > > related queue still
> > > >> can continue.
> > > >> So rx_burst() will try to read/modify TX queue data, that might
> be
> > > already freed,
> > > >> or simultaneously modified by control path.
> > > > Understood, agree on the issue
> > > >
> > > >>
> > > >> Again, it all can be mitigated by carefully re-designing and
> > > modifying control and
> > > >> data-path inside user app - by doing extra checks and
> > > synchronizations, etc.
> > > >> But from practical point - I presume most of users simply would
> > > avoid using this
> > > >> feature due all potential problems it might cause.
> > > > That is subjective, it all depends on the performance
> improvements
> > > users see in their application.
> > > > IMO, the performance improvement seen with this patch is worth
> few
> > > changes.
> > >
> > > Yes, it is subjective till some extent, though my feeling that it
> > > might end-up being sort of synthetic improvement used only by some
> > > show-case benchmarks.
> >
> > I believe that one specific important use case has already been
> mentioned, so I
> > don't think this is a benchmark only feature.
> +1
>
> >
> > > From my perspective, it would be much more plausible, if we can
> > > introduce some sort of generic improvement, that doesn't impose all
> > > these extra constraints and implications.
> > > Like one, discussed below in that thread with ZC mempool approach.
> > >
> >
> > Considering this feature from a high level perspective, I agree with
> Konstantin's
> > concerns, so I'll also support his views.
> We did hack the ZC mempool approach [1], level of improvement is pretty
> small compared with this patch.
>
> [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> 1-feifei.wang2@arm.com/
>
> >
> > If this patch is supposed to be a generic feature, please add support
> for it in all
> > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1
> mapping or
> > N:M mapping.) It is purely software, so it should be available for
> all PMDs, not
> > just your favorite hardware! Consider the "fast mbuf free" feature,
> which is
> > pure software; why is that feature not implemented in all PMDs?
> Agree, it is good to have it supported in all the drivers. We do not
> have a favorite hardware, just picked a PMD which we are more familiar
> with. We do plan to implement in other prominent PMDs.
>
> >
> > A secondary point I'm making here is that this specific feature will
> lead to an
> > enormous amount of copy-paste code, instead of a generic library
> function
> > easily available for all PMDs.
> Are you talking about the i40e driver code in specific? If yes, agree
> we should avoid copy-paste and we will look to reduce that.
Yes, I am talking about the code that needs to be copied into all prominent PMDs. Perhaps you can move the majority of it into a common directory, if not in a generic library, so the modification per PMD becomes smaller. (I see the same copy-paste issue with the "fast mbuf free" feature, if to be supported by other than the i40e PMD.)
Please note that I do not expect you to implement this feature in other PMDs than you need. I was trying to make the point that implementing a software feature in a PMD requires copy-pasting to other PMDs, which can require a big effort; while implementing it in a library and calling the library from the PMDs require a smaller effort per PMD. I intentionally phrased it somewhat provokingly, and was lucky not to offend anyone. :-)
>
>
> >
> > >
> > > >>
> > > >>> Like Feifei says, the RX side has the normal packet allocation
> > > >>> path
> > > still available.
> > > >>> Also this sounds like a corner case to me, we can handle this
> > > through checks in
> > > >> the queue_stop API.
> > > >>
> > > >> Depends.
> > > >> if it would be allowed to link queues only from the same port,
> then
> > > yes, extra
> > > >> checks for queue-stop might be enough.
> > > >> As right now DPDK doesn't allow user to change number of queues
> > > without
> > > >> dev_stop() first.
> > > >> Though if it would be allowed to link queues from different
> ports,
> > > then situation
> > > >> will be much worse.
> > > >> Right now ports are totally independent entities (except some
> > > special cases like
> > > >> link-bonding, etc.).
> > > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > > confgured,
> > > >> even detached, and newly attached device might re-use same port
> > > number.
> > > > I see this as a similar restriction to the one discussed above.
> > >
> > > Yes, they are similar in principal, though I think that the case
> with
> > > queues from different port would make things much more complex.
> > >
> > > > Do you see any issues if we enforce this with checks?
> > >
> > > Hard to tell straightway, a lot will depend how smart such
> > > implementation would be.
> > > Usually DPDK tends not to avoid heavy
> > > synchronizations within its data-path functions.
> >
> > Certainly! Implementing more and more of such features in the PMDs
> will lead
> > to longer and longer data plane code paths in the PMDs. It is the
> "salami
> > method", where each small piece makes no performance difference, but
> they all
> > add up, and eventually the sum of them does impact the performance of
> the
> > general use case negatively.
> It would be good to have a test running in UNH that shows the
> performance trend.
+1
>
> >
> > >
> > > >>
> > > >>
> > > >>>>
> > > >>>>>
> > > >>>>>> - very limited usage scenario - it will have a positive
> effect
> > > only
> > > >>>>>> when we have a fixed forwarding mapping: all (or nearly
> > > all) packets
> > > >>>>>> from the RX queue are forwarded into the same TX queue.
> > > >>>>>
> > > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > > scenario
> > > >>>>> has a wide range of applications, such as NIC with one port.
> > > >>>>
> > > >>>> yes, there are NICs with one port, but no guarantee there
> > > >>>> wouldn't
> > > be
> > > >>>> several such NICs within the system.
> > > >>> What I see in my interactions is, a single NIC/DPU is under
> > > utilized for a 2
> > > >> socket system. Some are adding more sockets to the system to
> better
> > > utilize the
> > > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > > think there will
> > > >> be a multi-DPU per server scenario.
> > > >>
> > > >>
> > > >> Interesting... from my experience it is visa-versa:
> > > >> in many cases 200Gb/s is not that much these days to saturate
> > > >> modern
> > > 2 socket
> > > >> x86 server.
> > > >> Though I suppose a lot depends on particular HW and actual
> workload.
> > > >>
> > > >>>
> > > >>>>
> > > >>>>> Furtrhermore, I think this is a tradeoff between performance
> and
> > > >>>>> flexibility.
> > > >>>>> Our goal is to achieve best performance, this means we need
> to
> > > give
> > > >>>>> up some flexibility decisively. For example of 'FAST_FREE
> Mode',
> > > it
> > > >>>>> deletes most of the buffer check (refcnt > 1, external
> buffer,
> > > chain
> > > >>>>> buffer), chooses a shorest path, and then achieve significant
> > > >>>>> performance
> > > >>>> improvement.
> > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC
> API,
> > > >>>>>> similar to one we have for the ring?
> > > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > > >>>>>> temporary array on the stack.
> > > >>>>>> Instead we can put them straight from TX SW ring to the
> mempool
> > > cache.
> > > >>>>>> That should save extra store/load for mbuf and might help to
> > > >>>>>> achieve some performance gain without by-passing mempool.
> > > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > > >>>>>> might
> > > be
> > > >>>>>> fast enough to consider as alternative.
> > > >>>>>> Again, it would be a generic one, so we can avoid all these
> > > >>>>>> implications and limitations.
> > > >>>>>
> > > >>>>> [Feifei] I think this is a good try. However, the most
> important
> > > >>>>> thing is that if we can bypass the mempool decisively to
> pursue
> > > the
> > > >>>>> significant performance gains.
> > > >>>>
> > > >>>> I understand the intention, and I personally think this is
> wrong
> > > and
> > > >>>> dangerous attitude.
> > > >>>> We have mempool abstraction in place for very good reason.
> > > >>>> So we need to try to improve mempool performance (and API if
> > > >>>> necessary) at first place, not to avoid it and break our own
> > > >>>> rules
> > > and
> > > >> recommendations.
> > > >>> The abstraction can be thought of at a higher level. i.e. the
> > > driver manages the
> > > >> buffer allocation/free and is hidden from the application. The
> > > application does
> > > >> not need to be aware of how these changes are implemented.
> > > >>>
> > > >>>>
> > > >>>>
> > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > > >>>>> includes
> > > buffer
> > > >>>>> pointer and index.
> > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > > need
> > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC,
> we
> > > still
> > > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > > pointers.
> > > >>>>
> > > >>>> When talking about ZC API for mempool cache I meant something
> > > like:
> > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache
> *mc,
> > > >>>> uint32_t *nb_elem, uint32_t flags); void
> > > >>>> mempool_cache_put_zc_finish(struct
> > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > > >>>> return user a pointer inside mp-cache where to put free elems
> and
> > > >>>> max
> > > number
> > > >>>> of slots that can be safely filled.
> > > >>>> _finish_ will update mc->len.
> > > >>>> As an example:
> > > >>>>
> > > >>>> /* expect to free N mbufs */
> > > >>>> uint32_t n = N;
> > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > > >>>>
> > > >>>> /* free up to n elems */
> > > >>>> for (i = 0; i != n; i++) {
> > > >>>>
> > > >>>> /* get next free mbuf from somewhere */
> > > >>>> mb = extract_and_prefree_mbuf(...);
> > > >>>>
> > > >>>> /* no more free mbufs for now */
> > > >>>> if (mb == NULL)
> > > >>>> break;
> > > >>>>
> > > >>>> p[i] = mb;
> > > >>>> }
> > > >>>>
> > > >>>> /* finalize ZC put, with _i_ freed elems */
> > > >>>> mempool_cache_put_zc_finish(mc, i);
> > > >>>>
> > > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > > >>>> you mentioned above. Plus it might be useful in other similar
> places.
> > > >>>>
> > > >>>> Another alternative is obviously to split i40e_tx_entry into
> two
> > > >>>> structs (one for mbuf, second for its metadata) and have a
> > > separate
> > > >>>> array for each of them.
> > > >>>> Though with that approach we need to make sure no perf drops
> will
> > > be
> > > >>>> introduced, plus probably more code changes will be required.
> > > >>> Commit '5171b4ee6b6" already does this (in a different way),
> but
> > > just for
> > > >> AVX512. Unfortunately, it does not record any performance
> > > improvements. We
> > > >> could port this to Arm NEON and look at the performance.
> > > >
> > >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-06-30 15:21 ` Morten Brørup
@ 2022-07-01 19:30 ` Honnappa Nagarahalli
2022-07-01 20:28 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-01 19:30 UTC (permalink / raw)
To: Morten Brørup, Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
<snip>
> > > > >>>>
> > > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > > >>>>>
> > > > >>>>>>> Currently, the transmit side frees the buffers into the
> > lcore
> > > > >>>>>>> cache and the receive side allocates buffers from the
> > > > >>>>>>> lcore
> > > > cache.
> > > > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> > allocates
> > > > 32
> > > > >>>>>>> buffers and stores them in the receive side software ring,
> > > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > > > lcore cache.
> > > > >>>>>>>
> > > > >>>>>>> This patch proposes a mechanism to avoid freeing
> > to/allocating
> > > > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > > > buffers
> > > > >>>>>>> from transmit side directly into it's software ring. This
> > will
> > > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > > > cache.
> > > > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > > > >>>>>>>
> > > > >>>>>>> However, this solution poses several constraints:
> > > > >>>>>>>
> > > > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > > > should
> > > > >>>>>>> take the buffers from. The application logic decides which
> > > > >>>>>>> transmit port to use to send out the packets. In many use
> > > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > > > >>>>>>> which case
> > > > a
> > > > >>>>>>> given transmit queue is always mapped to a single receive
> > > > >>>>>>> queue
> > > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > > >>>>>>>
> > > > >>>>>>> If the NIC has 2 ports (there are several references),
> > > > >>>>>>> then
> > we
> > > > >>>>>>> will have
> > > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > > configure.
> > > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > > >>>>>>> configuration can be long. More over the PMD would have to
> > > > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > > > >>>>>
> > > > >>>>>> Just to re-iterate some generic concerns about this
> > proposal:
> > > > >>>>>> - We effectively link RX and TX queues - when this
> > feature
> > > > is enabled,
> > > > >>>>>> user can't stop TX queue without stopping linked RX
> > queue
> > > > first.
> > > > >>>>>> Right now user is free to start/stop any queues at
> > > > >>>>>> his
> > > > will.
> > > > >>>>>> If that feature will allow to link queues from
> > different
> > > > ports,
> > > > >>>>>> then even ports will become dependent and user will
> > have
> > > > to pay extra
> > > > >>>>>> care when managing such ports.
> > > > >>>>>
> > > > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > > > >>>>> thread
> > > > to
> > > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> > buffers
> > > > >>>>> from Tx.
> > > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > > > >>>>> users
> > > > do
> > > > >>>>> not need to pay much attention managing ports.
> > > > >>>>
> > > > >>>> What I am talking about: right now different port or
> > > > >>>> different
> > > > queues
> > > > >>>> of the same port can be treated as independent entities:
> > > > >>>> in general user is free to start/stop (and even reconfigure
> > > > >>>> in
> > > > some
> > > > >>>> cases) one entity without need to stop other entity.
> > > > >>>> I.E user can stop and re-configure TX queue while keep
> > receiving
> > > > >>>> packets from RX queue.
> > > > >>>> With direct re-arm enabled, I think it wouldn't be possible
> > any
> > > > more:
> > > > >>>> before stopping/reconfiguring TX queue user would have make
> > sure
> > > > that
> > > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > > >>> I am trying to understand the problem better. For the TX queue
> > to
> > > > be stopped,
> > > > >> the user must have blocked the data plane from accessing the TX
> > > > queue.
> > > > >>
> > > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > > stopped/released queue.
> > > > >> The problem is that while TX for that queue is stopped, RX for
> > > > related queue still
> > > > >> can continue.
> > > > >> So rx_burst() will try to read/modify TX queue data, that might
> > be
> > > > already freed,
> > > > >> or simultaneously modified by control path.
> > > > > Understood, agree on the issue
> > > > >
> > > > >>
> > > > >> Again, it all can be mitigated by carefully re-designing and
> > > > modifying control and
> > > > >> data-path inside user app - by doing extra checks and
> > > > synchronizations, etc.
> > > > >> But from practical point - I presume most of users simply would
> > > > avoid using this
> > > > >> feature due all potential problems it might cause.
> > > > > That is subjective, it all depends on the performance
> > improvements
> > > > users see in their application.
> > > > > IMO, the performance improvement seen with this patch is worth
> > few
> > > > changes.
> > > >
> > > > Yes, it is subjective till some extent, though my feeling that it
> > > > might end-up being sort of synthetic improvement used only by some
> > > > show-case benchmarks.
> > >
> > > I believe that one specific important use case has already been
> > mentioned, so I
> > > don't think this is a benchmark only feature.
> > +1
> >
> > >
> > > > From my perspective, it would be much more plausible, if we can
> > > > introduce some sort of generic improvement, that doesn't impose
> > > > all these extra constraints and implications.
> > > > Like one, discussed below in that thread with ZC mempool approach.
> > > >
> > >
> > > Considering this feature from a high level perspective, I agree with
> > Konstantin's
> > > concerns, so I'll also support his views.
> > We did hack the ZC mempool approach [1], level of improvement is
> > pretty small compared with this patch.
> >
> > [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> > 1-feifei.wang2@arm.com/
> >
> > >
> > > If this patch is supposed to be a generic feature, please add
> > > support
> > for it in all
> > > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1
> > mapping or
> > > N:M mapping.) It is purely software, so it should be available for
> > all PMDs, not
> > > just your favorite hardware! Consider the "fast mbuf free" feature,
> > which is
> > > pure software; why is that feature not implemented in all PMDs?
> > Agree, it is good to have it supported in all the drivers. We do not
> > have a favorite hardware, just picked a PMD which we are more familiar
> > with. We do plan to implement in other prominent PMDs.
> >
> > >
> > > A secondary point I'm making here is that this specific feature will
> > lead to an
> > > enormous amount of copy-paste code, instead of a generic library
> > function
> > > easily available for all PMDs.
> > Are you talking about the i40e driver code in specific? If yes, agree
> > we should avoid copy-paste and we will look to reduce that.
>
> Yes, I am talking about the code that needs to be copied into all prominent
> PMDs. Perhaps you can move the majority of it into a common directory, if not
> in a generic library, so the modification per PMD becomes smaller. (I see the
> same copy-paste issue with the "fast mbuf free" feature, if to be supported by
> other than the i40e PMD.)
The current abstraction does not allow for common code at this (lower) level across all the PMDs. If we look at "fast free", it is accessing the device private structure for the list of buffers to free. If it needs to be common code, this needs to be lifted up along with other dependent configuration thresholds etc.
>
> Please note that I do not expect you to implement this feature in other PMDs
> than you need. I was trying to make the point that implementing a software
> feature in a PMD requires copy-pasting to other PMDs, which can require a big
> effort; while implementing it in a library and calling the library from the PMDs
> require a smaller effort per PMD. I intentionally phrased it somewhat
> provokingly, and was lucky not to offend anyone. :-)
>
> >
> >
> > >
> > > >
> > > > >>
> > > > >>> Like Feifei says, the RX side has the normal packet allocation
> > > > >>> path
> > > > still available.
> > > > >>> Also this sounds like a corner case to me, we can handle this
> > > > through checks in
> > > > >> the queue_stop API.
> > > > >>
> > > > >> Depends.
> > > > >> if it would be allowed to link queues only from the same port,
> > then
> > > > yes, extra
> > > > >> checks for queue-stop might be enough.
> > > > >> As right now DPDK doesn't allow user to change number of queues
> > > > without
> > > > >> dev_stop() first.
> > > > >> Though if it would be allowed to link queues from different
> > ports,
> > > > then situation
> > > > >> will be much worse.
> > > > >> Right now ports are totally independent entities (except some
> > > > special cases like
> > > > >> link-bonding, etc.).
> > > > >> As one port can keep doing RX/TX, second one can be stopped,
> > > > >> re-
> > > > confgured,
> > > > >> even detached, and newly attached device might re-use same port
> > > > number.
> > > > > I see this as a similar restriction to the one discussed above.
> > > >
> > > > Yes, they are similar in principal, though I think that the case
> > with
> > > > queues from different port would make things much more complex.
> > > >
> > > > > Do you see any issues if we enforce this with checks?
> > > >
> > > > Hard to tell straightway, a lot will depend how smart such
> > > > implementation would be.
> > > > Usually DPDK tends not to avoid heavy synchronizations within its
> > > > data-path functions.
> > >
> > > Certainly! Implementing more and more of such features in the PMDs
> > will lead
> > > to longer and longer data plane code paths in the PMDs. It is the
> > "salami
> > > method", where each small piece makes no performance difference, but
> > they all
> > > add up, and eventually the sum of them does impact the performance
> > > of
> > the
> > > general use case negatively.
> > It would be good to have a test running in UNH that shows the
> > performance trend.
> +1
>
> >
> > >
> > > >
> > > > >>
> > > > >>
> > > > >>>>
> > > > >>>>>
> > > > >>>>>> - very limited usage scenario - it will have a positive
> > effect
> > > > only
> > > > >>>>>> when we have a fixed forwarding mapping: all (or
> > > > >>>>>> nearly
> > > > all) packets
> > > > >>>>>> from the RX queue are forwarded into the same TX queue.
> > > > >>>>>
> > > > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > > > scenario
> > > > >>>>> has a wide range of applications, such as NIC with one port.
> > > > >>>>
> > > > >>>> yes, there are NICs with one port, but no guarantee there
> > > > >>>> wouldn't
> > > > be
> > > > >>>> several such NICs within the system.
> > > > >>> What I see in my interactions is, a single NIC/DPU is under
> > > > utilized for a 2
> > > > >> socket system. Some are adding more sockets to the system to
> > better
> > > > utilize the
> > > > >> DPU. The NIC bandwidth continues to grow significantly. I do
> > > > >> not
> > > > think there will
> > > > >> be a multi-DPU per server scenario.
> > > > >>
> > > > >>
> > > > >> Interesting... from my experience it is visa-versa:
> > > > >> in many cases 200Gb/s is not that much these days to saturate
> > > > >> modern
> > > > 2 socket
> > > > >> x86 server.
> > > > >> Though I suppose a lot depends on particular HW and actual
> > workload.
> > > > >>
> > > > >>>
> > > > >>>>
> > > > >>>>> Furtrhermore, I think this is a tradeoff between performance
> > and
> > > > >>>>> flexibility.
> > > > >>>>> Our goal is to achieve best performance, this means we need
> > to
> > > > give
> > > > >>>>> up some flexibility decisively. For example of 'FAST_FREE
> > Mode',
> > > > it
> > > > >>>>> deletes most of the buffer check (refcnt > 1, external
> > buffer,
> > > > chain
> > > > >>>>> buffer), chooses a shorest path, and then achieve
> > > > >>>>> significant performance
> > > > >>>> improvement.
> > > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC
> > API,
> > > > >>>>>> similar to one we have for the ring?
> > > > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > > > >>>>>> temporary array on the stack.
> > > > >>>>>> Instead we can put them straight from TX SW ring to the
> > mempool
> > > > cache.
> > > > >>>>>> That should save extra store/load for mbuf and might help
> > > > >>>>>> to achieve some performance gain without by-passing mempool.
> > > > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > > > >>>>>> might
> > > > be
> > > > >>>>>> fast enough to consider as alternative.
> > > > >>>>>> Again, it would be a generic one, so we can avoid all these
> > > > >>>>>> implications and limitations.
> > > > >>>>>
> > > > >>>>> [Feifei] I think this is a good try. However, the most
> > important
> > > > >>>>> thing is that if we can bypass the mempool decisively to
> > pursue
> > > > the
> > > > >>>>> significant performance gains.
> > > > >>>>
> > > > >>>> I understand the intention, and I personally think this is
> > wrong
> > > > and
> > > > >>>> dangerous attitude.
> > > > >>>> We have mempool abstraction in place for very good reason.
> > > > >>>> So we need to try to improve mempool performance (and API if
> > > > >>>> necessary) at first place, not to avoid it and break our own
> > > > >>>> rules
> > > > and
> > > > >> recommendations.
> > > > >>> The abstraction can be thought of at a higher level. i.e. the
> > > > driver manages the
> > > > >> buffer allocation/free and is hidden from the application. The
> > > > application does
> > > > >> not need to be aware of how these changes are implemented.
> > > > >>>
> > > > >>>>
> > > > >>>>
> > > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > > > >>>>> includes
> > > > buffer
> > > > >>>>> pointer and index.
> > > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly,
> > > > >>>>> we
> > > > need
> > > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC,
> > we
> > > > still
> > > > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > > > pointers.
> > > > >>>>
> > > > >>>> When talking about ZC API for mempool cache I meant something
> > > > like:
> > > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache
> > *mc,
> > > > >>>> uint32_t *nb_elem, uint32_t flags); void
> > > > >>>> mempool_cache_put_zc_finish(struct
> > > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > > > >>>> return user a pointer inside mp-cache where to put free elems
> > and
> > > > >>>> max
> > > > number
> > > > >>>> of slots that can be safely filled.
> > > > >>>> _finish_ will update mc->len.
> > > > >>>> As an example:
> > > > >>>>
> > > > >>>> /* expect to free N mbufs */
> > > > >>>> uint32_t n = N;
> > > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > > > >>>>
> > > > >>>> /* free up to n elems */
> > > > >>>> for (i = 0; i != n; i++) {
> > > > >>>>
> > > > >>>> /* get next free mbuf from somewhere */
> > > > >>>> mb = extract_and_prefree_mbuf(...);
> > > > >>>>
> > > > >>>> /* no more free mbufs for now */
> > > > >>>> if (mb == NULL)
> > > > >>>> break;
> > > > >>>>
> > > > >>>> p[i] = mb;
> > > > >>>> }
> > > > >>>>
> > > > >>>> /* finalize ZC put, with _i_ freed elems */
> > > > >>>> mempool_cache_put_zc_finish(mc, i);
> > > > >>>>
> > > > >>>> That way, I think we can overcome the issue with
> > > > >>>> i40e_tx_entry you mentioned above. Plus it might be useful in
> > > > >>>> other similar
> > places.
> > > > >>>>
> > > > >>>> Another alternative is obviously to split i40e_tx_entry into
> > two
> > > > >>>> structs (one for mbuf, second for its metadata) and have a
> > > > separate
> > > > >>>> array for each of them.
> > > > >>>> Though with that approach we need to make sure no perf drops
> > will
> > > > be
> > > > >>>> introduced, plus probably more code changes will be required.
> > > > >>> Commit '5171b4ee6b6" already does this (in a different way),
> > but
> > > > just for
> > > > >> AVX512. Unfortunately, it does not record any performance
> > > > improvements. We
> > > > >> could port this to Arm NEON and look at the performance.
> > > > >
> > > >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v1 0/5] Direct re-arming of buffers on receive side
2022-07-01 19:30 ` Honnappa Nagarahalli
@ 2022-07-01 20:28 ` Morten Brørup
0 siblings, 0 replies; 145+ messages in thread
From: Morten Brørup @ 2022-07-01 20:28 UTC (permalink / raw)
To: Honnappa Nagarahalli, Konstantin Ananyev, Feifei Wang
Cc: nd, dev, Ruifeng Wang, honnappanagarahalli, nd
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 1 July 2022 21.31
>
> <snip>
>
> > > > > >>>>
> > > > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > > > >>>>>
> > > > > >>>>>>> Currently, the transmit side frees the buffers into the
> > > lcore
> > > > > >>>>>>> cache and the receive side allocates buffers from the
> > > > > >>>>>>> lcore
> > > > > cache.
> > > > > >>>>>>> The transmit side typically frees 32 buffers resulting
> in
> > > > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> > > allocates
> > > > > 32
> > > > > >>>>>>> buffers and stores them in the receive side software
> ring,
> > > > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from
> the
> > > > > lcore cache.
> > > > > >>>>>>>
> > > > > >>>>>>> This patch proposes a mechanism to avoid freeing
> > > to/allocating
> > > > > >>>>>>> from the lcore cache. i.e. the receive side will free
> the
> > > > > buffers
> > > > > >>>>>>> from transmit side directly into it's software ring.
> This
> > > will
> > > > > >>>>>>> avoid the 256B of loads and stores introduced by the
> lcore
> > > > > cache.
> > > > > >>>>>>> It also frees up the cache lines used by the lcore
> cache.
> > > > > >>>>>>>
> > > > > >>>>>>> However, this solution poses several constraints:
> > > > > >>>>>>>
> > > > > >>>>>>> 1)The receive queue needs to know which transmit queue
> it
> > > > > should
> > > > > >>>>>>> take the buffers from. The application logic decides
> which
> > > > > >>>>>>> transmit port to use to send out the packets. In many
> use
> > > > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]),
> in
> > > > > >>>>>>> which case
> > > > > a
> > > > > >>>>>>> given transmit queue is always mapped to a single
> receive
> > > > > >>>>>>> queue
> > > > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > > > >>>>>>>
> > > > > >>>>>>> If the NIC has 2 ports (there are several references),
> > > > > >>>>>>> then
> > > we
> > > > > >>>>>>> will have
> > > > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > > > configure.
> > > > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > > > >>>>>>> configuration can be long. More over the PMD would have
> to
> > > > > >>>>>>> scan a list of transmit queues to pull the buffers
> from.
> > > > > >>>>>
> > > > > >>>>>> Just to re-iterate some generic concerns about this
> > > proposal:
> > > > > >>>>>> - We effectively link RX and TX queues - when this
> > > feature
> > > > > is enabled,
> > > > > >>>>>> user can't stop TX queue without stopping linked
> RX
> > > queue
> > > > > first.
> > > > > >>>>>> Right now user is free to start/stop any queues at
> > > > > >>>>>> his
> > > > > will.
> > > > > >>>>>> If that feature will allow to link queues from
> > > different
> > > > > ports,
> > > > > >>>>>> then even ports will become dependent and user
> will
> > > have
> > > > > to pay extra
> > > > > >>>>>> care when managing such ports.
> > > > > >>>>>
> > > > > >>>>> [Feifei] When direct rearm enabled, there are two path
> for
> > > > > >>>>> thread
> > > > > to
> > > > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> > > buffers
> > > > > >>>>> from Tx.
> > > > > >>>>> Otherwise, Rx will put buffers from mempool as usual.
> Thus,
> > > > > >>>>> users
> > > > > do
> > > > > >>>>> not need to pay much attention managing ports.
> > > > > >>>>
> > > > > >>>> What I am talking about: right now different port or
> > > > > >>>> different
> > > > > queues
> > > > > >>>> of the same port can be treated as independent entities:
> > > > > >>>> in general user is free to start/stop (and even
> reconfigure
> > > > > >>>> in
> > > > > some
> > > > > >>>> cases) one entity without need to stop other entity.
> > > > > >>>> I.E user can stop and re-configure TX queue while keep
> > > receiving
> > > > > >>>> packets from RX queue.
> > > > > >>>> With direct re-arm enabled, I think it wouldn't be
> possible
> > > any
> > > > > more:
> > > > > >>>> before stopping/reconfiguring TX queue user would have
> make
> > > sure
> > > > > that
> > > > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > > > >>> I am trying to understand the problem better. For the TX
> queue
> > > to
> > > > > be stopped,
> > > > > >> the user must have blocked the data plane from accessing the
> TX
> > > > > queue.
> > > > > >>
> > > > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > > > stopped/released queue.
> > > > > >> The problem is that while TX for that queue is stopped, RX
> for
> > > > > related queue still
> > > > > >> can continue.
> > > > > >> So rx_burst() will try to read/modify TX queue data, that
> might
> > > be
> > > > > already freed,
> > > > > >> or simultaneously modified by control path.
> > > > > > Understood, agree on the issue
> > > > > >
> > > > > >>
> > > > > >> Again, it all can be mitigated by carefully re-designing and
> > > > > modifying control and
> > > > > >> data-path inside user app - by doing extra checks and
> > > > > synchronizations, etc.
> > > > > >> But from practical point - I presume most of users simply
> would
> > > > > avoid using this
> > > > > >> feature due all potential problems it might cause.
> > > > > > That is subjective, it all depends on the performance
> > > improvements
> > > > > users see in their application.
> > > > > > IMO, the performance improvement seen with this patch is
> worth
> > > few
> > > > > changes.
> > > > >
> > > > > Yes, it is subjective till some extent, though my feeling that
> it
> > > > > might end-up being sort of synthetic improvement used only by
> some
> > > > > show-case benchmarks.
> > > >
> > > > I believe that one specific important use case has already been
> > > mentioned, so I
> > > > don't think this is a benchmark only feature.
> > > +1
> > >
> > > >
> > > > > From my perspective, it would be much more plausible, if we
> can
> > > > > introduce some sort of generic improvement, that doesn't impose
> > > > > all these extra constraints and implications.
> > > > > Like one, discussed below in that thread with ZC mempool
> approach.
> > > > >
> > > >
> > > > Considering this feature from a high level perspective, I agree
> with
> > > Konstantin's
> > > > concerns, so I'll also support his views.
> > > We did hack the ZC mempool approach [1], level of improvement is
> > > pretty small compared with this patch.
> > >
> > > [1]
> http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> > > 1-feifei.wang2@arm.com/
> > >
> > > >
> > > > If this patch is supposed to be a generic feature, please add
> > > > support
> > > for it in all
> > > > NIC PMDs, not just one. (Regardless if the feature is defined as
> 1:1
> > > mapping or
> > > > N:M mapping.) It is purely software, so it should be available
> for
> > > all PMDs, not
> > > > just your favorite hardware! Consider the "fast mbuf free"
> feature,
> > > which is
> > > > pure software; why is that feature not implemented in all PMDs?
> > > Agree, it is good to have it supported in all the drivers. We do
> not
> > > have a favorite hardware, just picked a PMD which we are more
> familiar
> > > with. We do plan to implement in other prominent PMDs.
> > >
> > > >
> > > > A secondary point I'm making here is that this specific feature
> will
> > > lead to an
> > > > enormous amount of copy-paste code, instead of a generic library
> > > function
> > > > easily available for all PMDs.
> > > Are you talking about the i40e driver code in specific? If yes,
> agree
> > > we should avoid copy-paste and we will look to reduce that.
> >
> > Yes, I am talking about the code that needs to be copied into all
> prominent
> > PMDs. Perhaps you can move the majority of it into a common
> directory, if not
> > in a generic library, so the modification per PMD becomes smaller. (I
> see the
> > same copy-paste issue with the "fast mbuf free" feature, if to be
> supported by
> > other than the i40e PMD.)
> The current abstraction does not allow for common code at this (lower)
> level across all the PMDs. If we look at "fast free", it is accessing
> the device private structure for the list of buffers to free. If it
> needs to be common code, this needs to be lifted up along with other
> dependent configuration thresholds etc.
Exactly. The "direct re-arm" feature has some of the same design properties as "fast free": They are both purely software features, and they are both embedded deeply in the PMD code.
I don't know if it is possible, but perhaps we could redesign the private buffer structures of the PMDs, so the first part of it is common across PMDs, and the following part is truly private. The common part would obviously hold the mbuf pointer. That way, a common library could manipulate the public part.
Or perhaps some other means could be provided for a common library to manipulate the common parts of a private structure, e.g. a common fast_free function could take a few parameters to work on private buffer structures: void * buffer_array, size_t element_size, size_t mbuf_offset_in_element, unsigned int num_elements.
>
> >
> > Please note that I do not expect you to implement this feature in
> other PMDs
> > than you need. I was trying to make the point that implementing a
> software
> > feature in a PMD requires copy-pasting to other PMDs, which can
> require a big
> > effort; while implementing it in a library and calling the library
> from the PMDs
> > require a smaller effort per PMD. I intentionally phrased it somewhat
> > provokingly, and was lucky not to offend anyone. :-)
Unless we find a better solution, copy-paste code across PMDs seems to be the only way to implement such features. And I agree that they should not be blocked due to "code complexity" or "copy-paste" arguments, if they are impossible to implement in a more "correct" way.
The DPDK community should accept such contributions to the common code, and be grateful that they don't just go into private forks of DPDK!
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v3 0/3] Direct re-arming of buffers on receive side
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (6 preceding siblings ...)
[not found] ` <20220516061012.618787-1-feifei.wang2@arm.com>
@ 2023-01-04 7:30 ` Feifei Wang
2023-01-04 7:30 ` [PATCH v3 1/3] ethdev: enable direct rearm with separate API Feifei Wang
` (4 more replies)
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (5 subsequent siblings)
13 siblings, 5 replies; 145+ messages in thread
From: Feifei Wang @ 2023-01-04 7:30 UTC (permalink / raw)
Cc: dev, konstantin.v.ananyev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.
However, this solution poses several constraints:
1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.
If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.
2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.
3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number.
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.
Current status in this patch:
1)Two APIs are added for users to enable direct-rearm mode:
In control plane, users can call 'rte_eth_rx_queue_rearm_data_get'
to get Rx sw_ring pointer and its rxq_info.
(This avoid Tx load Rx data directly);
In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx
buffers and free Tx buffers at the same time. Specifically, in this
API, there are two separated API for Rx and Tx.
For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx freed buffers.
For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based
on the rearm buffers.
Thus, this can separate Rx and Tx operation, and user can even re-arm
RX queue not from the same driver's TX queue, but from different
sources too.
-----------------------------------------------------------------------
control plane:
rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data);
data plane:
loop {
rte_eth_dev_direct_rearm(*rxq_rearm_data){
rte_eth_tx_fill_sw_ring{
for (i = 0; i <= 32; i++) {
sw_ring.mbuf[i] = tx.mbuf[i];
}
}
rte_eth_rx_flush_descriptor{
for (i = 0; i <= 32; i++) {
flush descs[i];
}
}
}
rte_eth_rx_burst;
rte_eth_tx_burst;
}
-----------------------------------------------------------------------
2)The i40e driver is changed to do the direct re-arm of the receive
side.
3)The ixgbe driver is changed to do the direct re-arm of the receive
side.
Testing status:
(1) dpdk l3fwd test with multiple drivers:
port 0: 82599 NIC port 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Thunderx2: +9.44% +7.14%
-------------------------------------------------------------
(2) dpdk l3fwd test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
*Direct rearm with exposing rx_sw_ring:
Without fast free With fast free
Ampere altra: +14.98% +15.77%
n1sdp: +6.47% +0.52%
-------------------------------------------------------------
(3) VPP test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
*Direct rearm with exposing rx_sw_ring:
Ampere altra: +4.59%
n1sdp: +5.4%
-------------------------------------------------------------
Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g
V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm
Feifei Wang (3):
ethdev: enable direct rearm with separate API
net/i40e: enable direct rearm with separate API
net/ixgbe: enable direct rearm with separate API
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
drivers/net/i40e/i40e_rxtx.c | 19 +++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++
drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++
drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 52 +++++++
lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 11 ++
lib/ethdev/version.map | 6 +
18 files changed, 504 insertions(+)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
@ 2023-01-04 7:30 ` Feifei Wang
2023-01-04 8:21 ` Morten Brørup
2023-02-02 14:33 ` Konstantin Ananyev
2023-01-04 7:30 ` [PATCH v3 2/3] net/i40e: " Feifei Wang
` (3 subsequent siblings)
4 siblings, 2 replies; 145+ messages in thread
From: Feifei Wang @ 2023-01-04 7:30 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, konstantin.v.ananyev, nd, Feifei Wang, Honnappa Nagarahalli,
Ruifeng Wang
Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
mode for separate Rx and Tx Operation. And this can support different
multiple sources in direct rearm mode. For examples, Rx driver is ixgbe,
and Tx driver is i40e.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 52 +++++++++++
lib/ethdev/rte_ethdev.h | 174 +++++++++++++++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 11 +++
lib/ethdev/version.map | 6 ++
6 files changed, 255 insertions(+)
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 6a550cfc83..bc539ec862 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -59,6 +59,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
+ eth_tx_fill_sw_ring_t tx_fill_sw_ring;
+ /** Flush Rx descriptor in direct rearm mode */
+ eth_rx_flush_descriptor_t rx_flush_descriptor;
/**
* Device data that is shared between primary and secondary processes
@@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
+/**< @internal Get rearm data for a receive queue of an Ethernet device. */
+typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev *dev,
+ uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
@@ -1215,6 +1223,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+ /** Get Rx queue rearm data */
+ eth_rxq_rearm_data_get_t rxq_rearm_data_get;
eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 48090c879a..c5dd5e30f6 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -276,6 +276,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
+ fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
fpo->rxq.data = dev->data->rx_queues;
fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 5d5e18db1e..2af5cb42fe 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -3282,6 +3282,21 @@ rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t rx_queue_id,
stat_idx, STAT_QMAP_RX));
}
+int
+rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_rx_queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ int nb_rearm = 0;
+
+ nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id, rxq_rearm_data);
+
+ if (nb_rearm > 0)
+ return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id, nb_rearm);
+
+ return 0;
+}
+
int
rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t fw_size)
{
@@ -5323,6 +5338,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
return 0;
}
+int
+rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (rxq_rearm_data == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx queue %u rearm data to NULL\n",
+ port_id, queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->rxq_rearm_data_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->rxq_rearm_data_get(dev, queue_id, rxq_rearm_data);
+
+ return 0;
+}
+
int
rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index c129ca1eaf..381c3d535f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
} __rte_cache_min_aligned;
+/**
+ * @internal
+ * Structure used to hold pointers to internal ethdev Rx rearm data.
+ * The main purpose is to load Rx queue rearm data in direct rearm mode.
+ */
+struct rte_eth_rxq_rearm_data {
+ void *rx_sw_ring;
+ uint16_t *rearm_start;
+ uint16_t *rearm_nb;
+} __rte_cache_min_aligned;
+
/* Generic Burst mode flag definition, values can be ORed. */
/**
@@ -3184,6 +3195,34 @@ int rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
uint16_t rx_queue_id,
uint8_t stat_idx);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Directly put Tx freed buffers into Rx sw-ring and flush desc.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param rxq_rearm_data
+ * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ * @return
+ * - 0: Success
+ */
+__rte_experimental
+int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
/**
* Retrieve the Ethernet address of an Ethernet device.
*
@@ -4782,6 +4821,27 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+/**
+ * Get rearm data about given ports's Rx queue.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet device for which rearm data
+ * will be got.
+ * @param rxq_rearm_data
+ * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
/**
* Retrieve information about the Rx packet burst mode.
*
@@ -6103,6 +6163,120 @@ static inline int rte_eth_tx_descriptor_status(uint16_t port_id,
return (*p->tx_descriptor_status)(qd, offset);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Fill Rx sw-ring with Tx buffers in direct rearm mode.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The index of the transmit queue.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param rxq_rearm_data
+ * A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled with
+ * the rearm data of a receive queue.
+ * @return
+ * - The number buffers correct to be filled in the Rx sw-ring.
+ * - (-EINVAL) bad port or queue.
+ * - (-ENODEV) bad port.
+ * - (-ENOTSUP) if the device does not support this function.
+ *
+ */
+__rte_experimental
+static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
+ uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (port_id >= RTE_MAX_ETHPORTS ||
+ queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid port_id=%u or queue_id=%u\n",
+ port_id, queue_id);
+ return -EINVAL;
+ }
+#endif
+
+ p = &rte_eth_fp_ops[port_id];
+ qd = p->txq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ queue_id, port_id);
+ return -ENODEV;
+ }
+#endif
+
+ if (p->tx_fill_sw_ring == NULL)
+ return -ENOTSUP;
+
+ return p->tx_fill_sw_ring(qd, rxq_rearm_data);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Flush Rx descriptor in direct rearm mode.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The index of the receive queue.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ *@param nb_rearm
+ * The number of Rx sw-ring buffers need to be flushed.
+ * @return
+ * - (0) if successful.
+ * - (-EINVAL) bad port or queue.
+ * - (-ENODEV) bad port.
+ * - (-ENOTSUP) if the device does not support this function.
+ */
+__rte_experimental
+static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
+ uint16_t queue_id, uint16_t nb_rearm)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (port_id >= RTE_MAX_ETHPORTS ||
+ queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid port_id=%u or queue_id=%u\n",
+ port_id, queue_id);
+ return -EINVAL;
+ }
+#endif
+
+ p = &rte_eth_fp_ops[port_id];
+ qd = p->rxq.data[queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ queue_id, port_id);
+ return -ENODEV;
+ }
+#endif
+
+ if (p->rx_flush_descriptor == NULL)
+ return -ENOTSUP;
+
+ return p->rx_flush_descriptor(qd, nb_rearm);
+}
+
/**
* @internal
* Helper routine for rte_eth_tx_burst().
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index dcf8adab92..5ecb57f6f0 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
/** @internal Check the status of a Tx descriptor */
typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
+/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
+typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
+/** @internal Flush Rx descriptor in direct rearm mode */
+typedef int (*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
+
/**
* @internal
* Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
eth_rx_queue_count_t rx_queue_count;
/** Check the status of a Rx descriptor. */
eth_rx_descriptor_status_t rx_descriptor_status;
+ /** Flush Rx descriptor in direct rearm mode */
+ eth_rx_flush_descriptor_t rx_flush_descriptor;
/** Rx queues data. */
struct rte_ethdev_qdata rxq;
uintptr_t reserved1[3];
@@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
eth_tx_prep_t tx_pkt_prepare;
/** Check the status of a Tx descriptor. */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
+ eth_tx_fill_sw_ring_t tx_fill_sw_ring;
/** Tx queues data. */
struct rte_ethdev_qdata txq;
uintptr_t reserved2[3];
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 17201fbe0f..f39f02a69b 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -298,6 +298,12 @@ EXPERIMENTAL {
rte_flow_get_q_aged_flows;
rte_mtr_meter_policy_get;
rte_mtr_meter_profile_get;
+
+ # added in 23.03
+ rte_eth_dev_direct_rearm;
+ rte_eth_rx_flush_descriptor;
+ rte_eth_rx_queue_rearm_data_get;
+ rte_eth_tx_fill_sw_ring;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
2023-01-04 7:30 ` [PATCH v3 1/3] ethdev: enable direct rearm with separate API Feifei Wang
@ 2023-01-04 7:30 ` Feifei Wang
2023-02-02 14:37 ` Konstantin Ananyev
2023-01-04 7:30 ` [PATCH v3 3/3] net/ixgbe: " Feifei Wang
` (2 subsequent siblings)
4 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-01-04 7:30 UTC (permalink / raw)
To: Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, konstantin.v.ananyev, nd, Feifei Wang, Honnappa Nagarahalli
Add internal API to separate direct rearm operations between
Rx and Tx.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
drivers/net/i40e/i40e_rxtx.c | 19 +++++++++
drivers/net/i40e/i40e_rxtx.h | 4 ++
drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++++++++++++++++++++
drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 +++++++++++++++++++
6 files changed, 122 insertions(+)
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 7726a89d99..29c1ce2470 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -497,6 +497,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+ .rxq_rearm_data_get = i40e_rxq_rearm_data_get,
.rx_burst_mode_get = i40e_rx_burst_mode_get,
.tx_burst_mode_get = i40e_tx_burst_mode_get,
.timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index fe943a45ff..6a6a2a6d3c 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1352,6 +1352,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_burst_mode *mode);
int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 788ffb51c2..d8d801acaf 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3197,6 +3197,19 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.offloads = txq->offloads;
}
+void
+i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct i40e_rx_queue *rxq;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ rxq_rearm_data->rx_sw_ring = rxq->sw_ring;
+ rxq_rearm_data->rearm_start = &rxq->rxrearm_start;
+ rxq_rearm_data->rearm_nb = &rxq->rxrearm_nb;
+}
+
#ifdef RTE_ARCH_X86
static inline bool
get_avx_supported(bool request_avx512)
@@ -3321,6 +3334,9 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Rx (port %d).",
dev->data->port_id);
dev->rx_pkt_burst = i40e_recv_pkts_vec;
+#ifdef RTE_ARCH_ARM64
+ dev->rx_flush_descriptor = i40e_rx_flush_descriptor_vec;
+#endif
}
#endif /* RTE_ARCH_X86 */
} else if (!dev->data->scattered_rx && ad->rx_bulk_alloc_allowed) {
@@ -3484,6 +3500,9 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
dev->data->port_id);
dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+#ifdef RTE_ARCH_ARM64
+ dev->tx_fill_sw_ring = i40e_tx_fill_sw_ring;
+#endif
#endif /* RTE_ARCH_X86 */
} else {
PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index 5e6eecc501..8a29bd89df 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
+int i40e_tx_fill_sw_ring(void *tx_queue,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+int i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm);
+
uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index fe1a6ec75e..eb96301a43 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -146,6 +146,60 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
return txq->tx_rs_thresh;
}
+int
+i40e_tx_fill_sw_ring(void *tx_queue,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ void **rxep;
+ struct rte_mbuf *m;
+ int i, n;
+ int nb_rearm = 0;
+
+ if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
+ txq->nb_tx_free > txq->tx_free_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = rxq_rearm_data->rx_sw_ring;
+ rxep += *rxq_rearm_data->rearm_start;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* directly put mbufs from Tx to Rx */
+ for (i = 0; i < n; i++, rxep++, txep++)
+ *rxep = txep[0].mbuf;
+ } else {
+ for (i = 0; i < n; i++, rxep++) {
+ m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ if (m != NULL) {
+ *rxep = m;
+ nb_rearm++;
+ }
+ }
+ n = nb_rearm;
+ }
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return n;
+}
+
static __rte_always_inline void
tx_backlog_entry(struct i40e_tx_entry *txep,
struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..1509d3223b 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -739,6 +739,48 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
return nb_pkts;
}
+int
+i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64x2_t dma_addr;
+ uint64_t paddr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_rearm; i++) {
+ /* Initialize rxdp descs */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = vdupq_n_u64(paddr);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr);
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_rearm;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+ rxq->rxrearm_nb -= nb_rearm;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+
+ return 0;
+}
+
void __rte_cold
i40e_rx_queue_release_mbufs_vec(struct i40e_rx_queue *rxq)
{
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v3 3/3] net/ixgbe: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
2023-01-04 7:30 ` [PATCH v3 1/3] ethdev: enable direct rearm with separate API Feifei Wang
2023-01-04 7:30 ` [PATCH v3 2/3] net/i40e: " Feifei Wang
@ 2023-01-04 7:30 ` Feifei Wang
2023-01-31 6:13 ` 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side Feifei Wang
2023-03-22 12:56 ` Morten Brørup
4 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-01-04 7:30 UTC (permalink / raw)
To: Qiming Yang, Wenjun Wu, Ruifeng Wang
Cc: dev, konstantin.v.ananyev, nd, Feifei Wang, Honnappa Nagarahalli
Add internal API to separate direct rearm operations between
Rx and Tx.
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 ++
drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++++++++
drivers/net/ixgbe/ixgbe_rxtx.h | 4 ++
drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 +++++++++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++++++++++++++++++
6 files changed, 127 insertions(+)
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index ae9f65b334..e5383d7dbc 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -542,6 +542,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+ .rxq_rearm_data_get = ixgbe_rxq_rearm_data_get,
.timesync_enable = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 48290af512..2a8ae5af7a 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -625,6 +625,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void ixgbe_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+
int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index c9d6ca9efe..2d7fe710e4 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2559,6 +2559,9 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
ixgbe_txq_vec_setup(txq) == 0)) {
PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
+#ifdef RTE_ARCH_ARM64
+ dev->tx_fill_sw_ring = ixgbe_tx_fill_sw_ring;
+#endif
} else
dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
} else {
@@ -4853,6 +4856,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
dev->data->port_id);
dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
+#ifdef RTE_ARCH_ARM64
+ dev->rx_flush_descriptor = ixgbe_rx_flush_descriptor_vec;
+#endif
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
"satisfied. Rx Burst Bulk Alloc function "
@@ -5623,6 +5629,19 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
}
+void
+ixgbe_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct ixgbe_rx_queue *rxq;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ rxq_rearm_data->rx_sw_ring = rxq->sw_ring;
+ rxq_rearm_data->rearm_start = &rxq->rxrearm_start;
+ rxq_rearm_data->rearm_nb = &rxq->rxrearm_nb;
+}
+
/*
* [VF] Initializes Receive Unit.
*/
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..7c90426f49 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
+int ixgbe_tx_fill_sw_ring(void *tx_queue,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data);
+int ixgbe_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm);
+
uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h b/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
index a4d9ec9b08..36799dd7f5 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_common.h
@@ -129,6 +129,54 @@ ixgbe_tx_free_bufs(struct ixgbe_tx_queue *txq)
return txq->tx_rs_thresh;
}
+int
+ixgbe_tx_fill_sw_ring(void *tx_queue,
+ struct rte_eth_rxq_rearm_data *rxq_rearm_data)
+{
+ struct ixgbe_tx_queue *txq = tx_queue;
+ struct ixgbe_tx_entry_v *txep;
+ void **rxep;
+ uint32_t status;
+ struct rte_mbuf *m;
+ int i, n;
+ int nb_rearm = 0;
+
+ if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
+ txq->nb_tx_free > txq->tx_free_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ status = txq->tx_ring[txq->tx_next_dd].wb.status;
+ if (!(status & IXGBE_ADVTXD_STAT_DD))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+
+ /* first buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1)
+ */
+ txep = &txq->sw_ring_v[txq->tx_next_dd - (n - 1)];
+ rxep = rxq_rearm_data->rx_sw_ring;
+ rxep += *rxq_rearm_data->rearm_start;
+
+ for (i = 0; i < n; i++, rxep++) {
+ m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+ if (m != NULL) {
+ *rxep = m;
+ nb_rearm++;
+ }
+ }
+ n = nb_rearm;
+
+ /* update counters for Tx */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return n;
+}
+
static __rte_always_inline void
tx_backlog_entry(struct ixgbe_tx_entry_v *txep,
struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
index 90b254ea26..af6a66c2d5 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c
@@ -633,6 +633,58 @@ ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
return nb_pkts;
}
+int
+ixgbe_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm)
+{
+ struct ixgbe_rx_queue *rxq = rx_queue;
+ struct ixgbe_rx_entry *rxep;
+ volatile union ixgbe_adv_rx_desc *rxdp;
+ struct rte_mbuf *mb;
+ uint16_t rx_id;
+ uint64x2_t dma_addr;
+ uint64x2_t zero = vdupq_n_u64(0);
+ uint64_t paddr;
+ uint8x8_t p;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ p = vld1_u8((uint8_t *)&rxq->mbuf_initializer);
+
+ for (i = 0; i < nb_rearm; i++) {
+ mb = rxep[i].mbuf;
+ /*
+ * Flush mbuf with pkt template.
+ * Data to be rearmed is 6 bytes long.
+ */
+ vst1_u8((uint8_t *)&mb->rearm_data, p);
+ /* Initialize rxdp descs */
+ paddr = mb->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = vsetq_lane_u64(paddr, zero, 0);
+ /* flush desc with pa dma_addr */
+ vst1q_u64((uint64_t *)&rxdp++->read, dma_addr);
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_rearm;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
+ if (!rxq->rxrearm_start)
+ rx_id = rxq->nb_rx_desc - 1;
+ else
+ rx_id = rxq->rxrearm_start - 1;
+ }
+ rxq->rxrearm_nb -= nb_rearm;
+
+ /* Update the tail pointer on the NIC */
+ IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+
+ return 0;
+}
+
static void __rte_cold
ixgbe_tx_queue_release_mbufs_vec(struct ixgbe_tx_queue *txq)
{
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 1/3] ethdev: enable direct rearm with separate API Feifei Wang
@ 2023-01-04 8:21 ` Morten Brørup
2023-01-04 8:51 ` 回复: " Feifei Wang
2023-03-06 12:49 ` Ferruh Yigit
2023-02-02 14:33 ` Konstantin Ananyev
1 sibling, 2 replies; 145+ messages in thread
From: Morten Brørup @ 2023-01-04 8:21 UTC (permalink / raw)
To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 4 January 2023 08.31
>
> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> mode for separate Rx and Tx Operation. And this can support different
> multiple sources in direct rearm mode. For examples, Rx driver is
> ixgbe,
> and Tx driver is i40e.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
This feature looks very promising for performance. I am pleased to see progress on it.
Please confirm that the fast path functions are still thread safe, i.e. one EAL thread may be calling rte_eth_rx_burst() while another EAL thread is calling rte_eth_tx_burst().
A few more comments below.
> lib/ethdev/ethdev_driver.h | 10 ++
> lib/ethdev/ethdev_private.c | 2 +
> lib/ethdev/rte_ethdev.c | 52 +++++++++++
> lib/ethdev/rte_ethdev.h | 174 +++++++++++++++++++++++++++++++++++
> lib/ethdev/rte_ethdev_core.h | 11 +++
> lib/ethdev/version.map | 6 ++
> 6 files changed, 255 insertions(+)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..bc539ec862 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
> eth_rx_descriptor_status_t rx_descriptor_status;
> /** Check the status of a Tx descriptor */
> eth_tx_descriptor_status_t tx_descriptor_status;
> + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific term and/or implementation detail, e.g. by providing a conceptual implementation for a non-Intel PMD, e.g. mlx5.
Please note: I do not request the ability to rearm between drivers from different vendors, I only request that the public ethdev API uses generic terms and concepts, so any NIC vendor can implement the direct-rearm functions in their PMDs.
> + /** Flush Rx descriptor in direct rearm mode */
> + eth_rx_flush_descriptor_t rx_flush_descriptor;
descriptor -> descriptors. There are more than one. Both in comment and function name.
[...]
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..381c3d535f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
> } __rte_cache_min_aligned;
>
> +/**
> + * @internal
> + * Structure used to hold pointers to internal ethdev Rx rearm data.
> + * The main purpose is to load Rx queue rearm data in direct rearm
> mode.
> + */
> +struct rte_eth_rxq_rearm_data {
> + void *rx_sw_ring;
> + uint16_t *rearm_start;
> + uint16_t *rearm_nb;
> +} __rte_cache_min_aligned;
Please add descriptions to the fields in this structure.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 8:21 ` Morten Brørup
@ 2023-01-04 8:51 ` Feifei Wang
2023-01-04 10:11 ` Morten Brørup
2023-03-06 12:49 ` Ferruh Yigit
1 sibling, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-01-04 8:51 UTC (permalink / raw)
To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
Hi, Morten
> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, January 4, 2023 4:22 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 4 January 2023 08.31
> >
> > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> > mode for separate Rx and Tx Operation. And this can support different
> > multiple sources in direct rearm mode. For examples, Rx driver is
> > ixgbe, and Tx driver is i40e.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
>
> This feature looks very promising for performance. I am pleased to see
> progress on it.
>
Thanks very much for your reviewing.
> Please confirm that the fast path functions are still thread safe, i.e. one EAL
> thread may be calling rte_eth_rx_burst() while another EAL thread is calling
> rte_eth_tx_burst().
>
For the multiple threads safe, like we say in cover letter, current direct-rearm support
Rx and Tx in the same thread. If we consider multiple threads like 'pipeline model', there
need to add 'lock' in the data path which can decrease the performance.
Thus, the first step we do is try to enable direct-rearm in the single thread, and then we will consider
to enable direct rearm in multiple threads and improve the performance.
> A few more comments below.
>
> > lib/ethdev/ethdev_driver.h | 10 ++
> > lib/ethdev/ethdev_private.c | 2 +
> > lib/ethdev/rte_ethdev.c | 52 +++++++++++
> > lib/ethdev/rte_ethdev.h | 174
> +++++++++++++++++++++++++++++++++++
> > lib/ethdev/rte_ethdev_core.h | 11 +++
> > lib/ethdev/version.map | 6 ++
> > 6 files changed, 255 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 6a550cfc83..bc539ec862 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > eth_rx_descriptor_status_t rx_descriptor_status;
> > /** Check the status of a Tx descriptor */
> > eth_tx_descriptor_status_t tx_descriptor_status;
> > + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
>
> What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific
> term and/or implementation detail, e.g. by providing a conceptual
> implementation for a non-Intel PMD, e.g. mlx5.
Rx sw_ring is used to store mbufs in intel PMD. This is the same as 'rxq->elts'
in mlx5. Agree with that we need to providing a conceptual implementation for
all PMDs.
>
> Please note: I do not request the ability to rearm between drivers from
> different vendors, I only request that the public ethdev API uses generic
> terms and concepts, so any NIC vendor can implement the direct-rearm
> functions in their PMDs.
Agree with this. This is also our consideration.
In the future plan, we plan to implement this function in different PMDs.
>
> > + /** Flush Rx descriptor in direct rearm mode */
> > + eth_rx_flush_descriptor_t rx_flush_descriptor;
>
> descriptor -> descriptors. There are more than one. Both in comment and
> function name.
Agree.
>
> [...]
>
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > c129ca1eaf..381c3d535f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> > uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
> > } __rte_cache_min_aligned;
> >
> > +/**
> > + * @internal
> > + * Structure used to hold pointers to internal ethdev Rx rearm data.
> > + * The main purpose is to load Rx queue rearm data in direct rearm
> > mode.
> > + */
> > +struct rte_eth_rxq_rearm_data {
> > + void *rx_sw_ring;
> > + uint16_t *rearm_start;
> > + uint16_t *rearm_nb;
> > +} __rte_cache_min_aligned;
>
> Please add descriptions to the fields in this structure.
Agree.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 8:51 ` 回复: " Feifei Wang
@ 2023-01-04 10:11 ` Morten Brørup
2023-02-24 8:55 ` 回复: " Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2023-01-04 10:11 UTC (permalink / raw)
To: Feifei Wang, thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
> From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> Sent: Wednesday, 4 January 2023 09.51
>
> Hi, Morten
>
> > 发件人: Morten Brørup <mb@smartsharesystems.com>
> > 发送时间: Wednesday, January 4, 2023 4:22 PM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 4 January 2023 08.31
> > >
> > > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> rearm
> > > mode for separate Rx and Tx Operation. And this can support
> different
> > > multiple sources in direct rearm mode. For examples, Rx driver is
> > > ixgbe, and Tx driver is i40e.
> > >
> > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> >
> > This feature looks very promising for performance. I am pleased to
> see
> > progress on it.
> >
> Thanks very much for your reviewing.
>
> > Please confirm that the fast path functions are still thread safe,
> i.e. one EAL
> > thread may be calling rte_eth_rx_burst() while another EAL thread is
> calling
> > rte_eth_tx_burst().
> >
> For the multiple threads safe, like we say in cover letter, current
> direct-rearm support
> Rx and Tx in the same thread. If we consider multiple threads like
> 'pipeline model', there
> need to add 'lock' in the data path which can decrease the performance.
> Thus, the first step we do is try to enable direct-rearm in the single
> thread, and then we will consider
> to enable direct rearm in multiple threads and improve the performance.
OK, doing it in steps is a good idea for a feature like this - makes it easier to understand and review.
When proceeding to add support for the "pipeline model", perhaps the lockless principles from the rte_ring can be used in this feature too.
From a high level perspective, I'm somewhat worried that releasing a "work-in-progress" version of this feature in some DPDK version will cause API/ABI breakage discussions when progressing to the next steps of the implementation to make the feature more complete. Not only support for thread safety across simultaneous RX and TX, but also support for multiple mbuf pools per RX queue [1]. Marking the functions experimental should alleviate such discussions, but there is a risk of pushback to not break the API/ABI anyway.
[1]: https://elixir.bootlin.com/dpdk/v22.11.1/source/lib/ethdev/rte_ethdev.h#L1105
[...]
> > > --- a/lib/ethdev/ethdev_driver.h
> > > +++ b/lib/ethdev/ethdev_driver.h
> > > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > > eth_rx_descriptor_status_t rx_descriptor_status;
> > > /** Check the status of a Tx descriptor */
> > > eth_tx_descriptor_status_t tx_descriptor_status;
> > > + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > > + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> >
> > What is "Rx sw-ring"? Please confirm that this is not an Intel PMD
> specific
> > term and/or implementation detail, e.g. by providing a conceptual
> > implementation for a non-Intel PMD, e.g. mlx5.
> Rx sw_ring is used to store mbufs in intel PMD. This is the same as
> 'rxq->elts'
> in mlx5.
Sounds good.
Then all we need is consensus on a generic name for this, unless "Rx sw-ring" already is the generic name. (I'm not a PMD developer, so I might be completely off track here.) Naming is often debatable, so I'll stop talking about it now - I only wanted to highlight that we should avoid vendor-specific terms in public APIs intended to be implemented by multiple vendors. On the other hand... if no other vendors raise their voices before merging into the DPDK main repository, they forfeit their right to complain about it. ;-)
> Agree with that we need to providing a conceptual
> implementation for all PMDs.
My main point is that we should ensure that the feature is not too tightly coupled with the way Intel PMDs implement mbuf handling. Providing a conceptual implementation for a non-Intel PMD is one way of checking this.
The actual implementation in other PMDs could be left up to the various NIC vendors.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
` (2 preceding siblings ...)
2023-01-04 7:30 ` [PATCH v3 3/3] net/ixgbe: " Feifei Wang
@ 2023-01-31 6:13 ` Feifei Wang
2023-02-01 1:10 ` Konstantin Ananyev
2023-03-22 12:56 ` Morten Brørup
4 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-01-31 6:13 UTC (permalink / raw)
To: konstantin.v.ananyev; +Cc: dev, nd, nd
+ping konstantin,
Would you please give some comments for this patch series?
Thanks very much.
Best Regards
Feifei
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-01-31 6:13 ` 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side Feifei Wang
@ 2023-02-01 1:10 ` Konstantin Ananyev
2023-02-01 2:24 ` 回复: " Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-02-01 1:10 UTC (permalink / raw)
To: Feifei Wang; +Cc: dev, nd
Hi Feifei,
> +ping konstantin,
>
> Would you please give some comments for this patch series?
> Thanks very much.
Sure, will have a look in next few days.
Apologies for the delay.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-02-01 1:10 ` Konstantin Ananyev
@ 2023-02-01 2:24 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-02-01 2:24 UTC (permalink / raw)
To: Konstantin Ananyev; +Cc: dev, nd, nd
That's all right. Thanks very much for your attention~
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Wednesday, February 1, 2023 9:11 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>
> 主题: Re: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side
>
> Hi Feifei,
>
> > +ping konstantin,
> >
> > Would you please give some comments for this patch series?
> > Thanks very much.
>
> Sure, will have a look in next few days.
> Apologies for the delay.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 1/3] ethdev: enable direct rearm with separate API Feifei Wang
2023-01-04 8:21 ` Morten Brørup
@ 2023-02-02 14:33 ` Konstantin Ananyev
2023-02-24 9:45 ` 回复: " Feifei Wang
1 sibling, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-02-02 14:33 UTC (permalink / raw)
To: Feifei Wang, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang
Hi Feifei,
> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> mode for separate Rx and Tx Operation. And this can support different
> multiple sources in direct rearm mode. For examples, Rx driver is ixgbe,
> and Tx driver is i40e.
Thanks for your effort and thanks for taking comments provided into
consideration.
That approach looks much better then previous ones.
Few nits below.
Konstantin
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> lib/ethdev/ethdev_driver.h | 10 ++
> lib/ethdev/ethdev_private.c | 2 +
> lib/ethdev/rte_ethdev.c | 52 +++++++++++
> lib/ethdev/rte_ethdev.h | 174 +++++++++++++++++++++++++++++++++++
> lib/ethdev/rte_ethdev_core.h | 11 +++
> lib/ethdev/version.map | 6 ++
> 6 files changed, 255 insertions(+)
>
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..bc539ec862 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -59,6 +59,10 @@ struct rte_eth_dev {
> eth_rx_descriptor_status_t rx_descriptor_status;
> /** Check the status of a Tx descriptor */
> eth_tx_descriptor_status_t tx_descriptor_status;
> + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> + /** Flush Rx descriptor in direct rearm mode */
> + eth_rx_flush_descriptor_t rx_flush_descriptor;
>
> /**
> * Device data that is shared between primary and secondary processes
> @@ -504,6 +508,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
> typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
> uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
>
> +/**< @internal Get rearm data for a receive queue of an Ethernet device. */
> +typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev *dev,
> + uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
> typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
> uint16_t queue_id, struct rte_eth_burst_mode *mode);
>
> @@ -1215,6 +1223,8 @@ struct eth_dev_ops {
> eth_rxq_info_get_t rxq_info_get;
> /** Retrieve Tx queue information */
> eth_txq_info_get_t txq_info_get;
> + /** Get Rx queue rearm data */
> + eth_rxq_rearm_data_get_t rxq_rearm_data_get;
> eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
> eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
> eth_fw_version_get_t fw_version_get; /**< Get firmware version */
> diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> index 48090c879a..c5dd5e30f6 100644
> --- a/lib/ethdev/ethdev_private.c
> +++ b/lib/ethdev/ethdev_private.c
> @@ -276,6 +276,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
> fpo->rx_queue_count = dev->rx_queue_count;
> fpo->rx_descriptor_status = dev->rx_descriptor_status;
> fpo->tx_descriptor_status = dev->tx_descriptor_status;
> + fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
> + fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
>
> fpo->rxq.data = dev->data->rx_queues;
> fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 5d5e18db1e..2af5cb42fe 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -3282,6 +3282,21 @@ rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t rx_queue_id,
> stat_idx, STAT_QMAP_RX));
> }
>
> +int
> +rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_rx_queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> + int nb_rearm = 0;
> +
> + nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id, rxq_rearm_data);
> +
> + if (nb_rearm > 0)
> + return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id, nb_rearm);
> +
> + return 0;
> +}
> +
> int
> rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t fw_size)
> {
> @@ -5323,6 +5338,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> return 0;
> }
>
> +int
> +rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> + struct rte_eth_dev *dev;
> +
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> + dev = &rte_eth_devices[port_id];
> +
> + if (queue_id >= dev->data->nb_rx_queues) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
> + return -EINVAL;
> + }
> +
> + if (rxq_rearm_data == NULL) {
> + RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx queue %u rearm data to NULL\n",
> + port_id, queue_id);
> + return -EINVAL;
> + }
> +
> + if (dev->data->rx_queues == NULL ||
> + dev->data->rx_queues[queue_id] == NULL) {
> + RTE_ETHDEV_LOG(ERR,
> + "Rx queue %"PRIu16" of device with port_id=%"
> + PRIu16" has not been setup\n",
> + queue_id, port_id);
> + return -EINVAL;
> + }
> +
> + if (*dev->dev_ops->rxq_rearm_data_get == NULL)
> + return -ENOTSUP;
> +
> + dev->dev_ops->rxq_rearm_data_get(dev, queue_id, rxq_rearm_data);
> +
> + return 0;
> +}
> +
> int
> rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
> struct rte_eth_burst_mode *mode)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..381c3d535f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
> } __rte_cache_min_aligned;
>
> +/**
> + * @internal
> + * Structure used to hold pointers to internal ethdev Rx rearm data.
> + * The main purpose is to load Rx queue rearm data in direct rearm mode.
> + */
I think this structure owes a lot more expalantion.
What each fields suppose to do and what are the constraints, etc.
In general, more doc will be needed to explain that feature.
> +struct rte_eth_rxq_rearm_data {
> + void *rx_sw_ring;
That's misleading, we always suppose to store mbufs ptrs here,
so why not be direct:
struct rte_mbuf **rx_sw_ring;
> + uint16_t *rearm_start;
> + uint16_t *rearm_nb;
I know that for Intel NICs uint16_t is sufficient,
wonder would it always be for other vendors?
Another thing to consider the case when ring position wrapping?
Again I know that it is not required for Intel NICs, but would
it be sufficient for API that supposed to be general?
> +} __rte_cache_min_aligned;
> +
> /* Generic Burst mode flag definition, values can be ORed. */
>
> /**
> @@ -3184,6 +3195,34 @@ int rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
> uint16_t rx_queue_id,
> uint8_t stat_idx);
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Directly put Tx freed buffers into Rx sw-ring and flush desc.
> + *
> + * @param rx_port_id
> + * Port identifying the receive side.
> + * @param rx_queue_id
> + * The index of the receive queue identifying the receive side.
> + * The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param tx_port_id
> + * Port identifying the transmit side.
> + * @param tx_queue_id
> + * The index of the transmit queue identifying the transmit side.
> + * The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param rxq_rearm_data
> + * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + * @return
> + * - 0: Success
> + */
> +__rte_experimental
> +int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> + uint16_t tx_port_id, uint16_t tx_queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
I think we need one more parameter for that function 'uint16_t offset'
or so.
So _rearm_ will start to populate rx_sw_ring from *rearm_start + offset
position. That way we can support populating from different sources.
Or should 'offset' be part of truct rte_eth_rxq_rearm_data?
> +
> /**
> * Retrieve the Ethernet address of an Ethernet device.
> *
> @@ -4782,6 +4821,27 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> struct rte_eth_txq_info *qinfo);
>
> +/**
> + * Get rearm data about given ports's Rx queue.
> + *
> + * @param port_id
> + * The port identifier of the Ethernet device.
> + * @param queue_id
> + * The Rx queue on the Ethernet device for which rearm data
> + * will be got.
> + * @param rxq_rearm_data
> + * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> + *
> + * @return
> + * - 0: Success
> + * - -ENODEV: If *port_id* is invalid.
> + * - -ENOTSUP: routine is not supported by the device PMD.
> + * - -EINVAL: The queue_id is out of range.
> + */
> +__rte_experimental
> +int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
> /**
> * Retrieve information about the Rx packet burst mode.
> *
> @@ -6103,6 +6163,120 @@ static inline int rte_eth_tx_descriptor_status(uint16_t port_id,
> return (*p->tx_descriptor_status)(qd, offset);
> }
>
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Fill Rx sw-ring with Tx buffers in direct rearm mode.
> + *
> + * @param port_id
> + * The port identifier of the Ethernet device.
> + * @param queue_id
> + * The index of the transmit queue.
> + * The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + * @param rxq_rearm_data
> + * A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled with
> + * the rearm data of a receive queue.
> + * @return
> + * - The number buffers correct to be filled in the Rx sw-ring.
> + * - (-EINVAL) bad port or queue.
> + * - (-ENODEV) bad port.
> + * - (-ENOTSUP) if the device does not support this function.
> + *
> + */
> +__rte_experimental
> +static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
> + uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> + struct rte_eth_fp_ops *p;
> + void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> + if (port_id >= RTE_MAX_ETHPORTS ||
> + queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> + RTE_ETHDEV_LOG(ERR,
> + "Invalid port_id=%u or queue_id=%u\n",
> + port_id, queue_id);
> + return -EINVAL;
> + }
> +#endif
> +
> + p = &rte_eth_fp_ops[port_id];
> + qd = p->txq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_TX
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> +
> + if (qd == NULL) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
> + queue_id, port_id);
> + return -ENODEV;
> + }
> +#endif
> +
> + if (p->tx_fill_sw_ring == NULL)
> + return -ENOTSUP;
> +
> + return p->tx_fill_sw_ring(qd, rxq_rearm_data);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Flush Rx descriptor in direct rearm mode.
> + *
> + * @param port_id
> + * The port identifier of the Ethernet device.
> + * @param queue_id
> + * The index of the receive queue.
> + * The value must be in the range [0, nb_rx_queue - 1] previously supplied
> + * to rte_eth_dev_configure().
> + *@param nb_rearm
> + * The number of Rx sw-ring buffers need to be flushed.
> + * @return
> + * - (0) if successful.
> + * - (-EINVAL) bad port or queue.
> + * - (-ENODEV) bad port.
> + * - (-ENOTSUP) if the device does not support this function.
> + */
> +__rte_experimental
> +static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
> + uint16_t queue_id, uint16_t nb_rearm)
> +{
> + struct rte_eth_fp_ops *p;
> + void *qd;
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> + if (port_id >= RTE_MAX_ETHPORTS ||
> + queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> + RTE_ETHDEV_LOG(ERR,
> + "Invalid port_id=%u or queue_id=%u\n",
> + port_id, queue_id);
> + return -EINVAL;
> + }
> +#endif
> +
> + p = &rte_eth_fp_ops[port_id];
> + qd = p->rxq.data[queue_id];
> +
> +#ifdef RTE_ETHDEV_DEBUG_RX
> + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> +
> + if (qd == NULL) {
> + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
> + queue_id, port_id);
> + return -ENODEV;
> + }
> +#endif
> +
> + if (p->rx_flush_descriptor == NULL)
> + return -ENOTSUP;
> +
> + return p->rx_flush_descriptor(qd, nb_rearm);
> +}
> +
> /**
> * @internal
> * Helper routine for rte_eth_tx_burst().
> diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
> index dcf8adab92..5ecb57f6f0 100644
> --- a/lib/ethdev/rte_ethdev_core.h
> +++ b/lib/ethdev/rte_ethdev_core.h
> @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
> /** @internal Check the status of a Tx descriptor */
> typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
>
> +/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
> +typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +
> +/** @internal Flush Rx descriptor in direct rearm mode */
> +typedef int (*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
> +
> /**
> * @internal
> * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> @@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
> eth_rx_queue_count_t rx_queue_count;
> /** Check the status of a Rx descriptor. */
> eth_rx_descriptor_status_t rx_descriptor_status;
> + /** Flush Rx descriptor in direct rearm mode */
> + eth_rx_flush_descriptor_t rx_flush_descriptor;
> /** Rx queues data. */
> struct rte_ethdev_qdata rxq;
> uintptr_t reserved1[3];
> @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
> eth_tx_prep_t tx_pkt_prepare;
> /** Check the status of a Tx descriptor. */
> eth_tx_descriptor_status_t tx_descriptor_status;
> + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> /** Tx queues data. */
> struct rte_ethdev_qdata txq;
> uintptr_t reserved2[3];
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 17201fbe0f..f39f02a69b 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -298,6 +298,12 @@ EXPERIMENTAL {
> rte_flow_get_q_aged_flows;
> rte_mtr_meter_policy_get;
> rte_mtr_meter_profile_get;
> +
> + # added in 23.03
> + rte_eth_dev_direct_rearm;
> + rte_eth_rx_flush_descriptor;
> + rte_eth_rx_queue_rearm_data_get;
> + rte_eth_tx_fill_sw_ring;
> };
>
> INTERNAL {
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-01-04 7:30 ` [PATCH v3 2/3] net/i40e: " Feifei Wang
@ 2023-02-02 14:37 ` Konstantin Ananyev
2023-02-24 9:50 ` 回复: " Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-02-02 14:37 UTC (permalink / raw)
To: Feifei Wang, Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli
04/01/2023 07:30, Feifei Wang пишет:
> Add internal API to separate direct rearm operations between
> Rx and Tx.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> drivers/net/i40e/i40e_ethdev.c | 1 +
> drivers/net/i40e/i40e_ethdev.h | 2 +
> drivers/net/i40e/i40e_rxtx.c | 19 +++++++++
> drivers/net/i40e/i40e_rxtx.h | 4 ++
> drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++++++++++++++++++++
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 +++++++++++++++++++
> 6 files changed, 122 insertions(+)
>
> diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
> index 7726a89d99..29c1ce2470 100644
> --- a/drivers/net/i40e/i40e_ethdev.c
> +++ b/drivers/net/i40e/i40e_ethdev.c
> @@ -497,6 +497,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
> .flow_ops_get = i40e_dev_flow_ops_get,
> .rxq_info_get = i40e_rxq_info_get,
> .txq_info_get = i40e_txq_info_get,
> + .rxq_rearm_data_get = i40e_rxq_rearm_data_get,
> .rx_burst_mode_get = i40e_rx_burst_mode_get,
> .tx_burst_mode_get = i40e_tx_burst_mode_get,
> .timesync_enable = i40e_timesync_enable,
> diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
> index fe943a45ff..6a6a2a6d3c 100644
> --- a/drivers/net/i40e/i40e_ethdev.h
> +++ b/drivers/net/i40e/i40e_ethdev.h
> @@ -1352,6 +1352,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> struct rte_eth_rxq_info *qinfo);
> void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> struct rte_eth_txq_info *qinfo);
> +void i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> struct rte_eth_burst_mode *mode);
> int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
> index 788ffb51c2..d8d801acaf 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -3197,6 +3197,19 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> qinfo->conf.offloads = txq->offloads;
> }
>
> +void
> +i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> + struct i40e_rx_queue *rxq;
> +
> + rxq = dev->data->rx_queues[queue_id];
> +
> + rxq_rearm_data->rx_sw_ring = rxq->sw_ring;
> + rxq_rearm_data->rearm_start = &rxq->rxrearm_start;
> + rxq_rearm_data->rearm_nb = &rxq->rxrearm_nb;
> +}
> +
> #ifdef RTE_ARCH_X86
> static inline bool
> get_avx_supported(bool request_avx512)
> @@ -3321,6 +3334,9 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
> PMD_INIT_LOG(DEBUG, "Using Vector Rx (port %d).",
> dev->data->port_id);
> dev->rx_pkt_burst = i40e_recv_pkts_vec;
> +#ifdef RTE_ARCH_ARM64
> + dev->rx_flush_descriptor = i40e_rx_flush_descriptor_vec;
> +#endif
> }
> #endif /* RTE_ARCH_X86 */
> } else if (!dev->data->scattered_rx && ad->rx_bulk_alloc_allowed) {
> @@ -3484,6 +3500,9 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
> PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
> dev->data->port_id);
> dev->tx_pkt_burst = i40e_xmit_pkts_vec;
> +#ifdef RTE_ARCH_ARM64
> + dev->tx_fill_sw_ring = i40e_tx_fill_sw_ring;
> +#endif
As I can see tx_fill_sw_ring() is non ARM specific, any reason to guard
it with #ifdef ARM?
Actually same ask for rx_flush_descriptor() - can we have generic
version too?
> #endif /* RTE_ARCH_X86 */
> } else {
> PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
> diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
> index 5e6eecc501..8a29bd89df 100644
> --- a/drivers/net/i40e/i40e_rxtx.h
> +++ b/drivers/net/i40e/i40e_rxtx.h
> @@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
> int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
> int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
>
> +int i40e_tx_fill_sw_ring(void *tx_queue,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> +int i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm);
> +
> uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> uint16_t nb_pkts);
> uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
> index fe1a6ec75e..eb96301a43 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> @@ -146,6 +146,60 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> return txq->tx_rs_thresh;
> }
>
> +int
> +i40e_tx_fill_sw_ring(void *tx_queue,
> + struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> +{
> + struct i40e_tx_queue *txq = tx_queue;
> + struct i40e_tx_entry *txep;
> + void **rxep;
> + struct rte_mbuf *m;
> + int i, n;
> + int nb_rearm = 0;
> +
> + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> + txq->nb_tx_free > txq->tx_free_thresh)
> + return 0;
> +
> + /* check DD bits on threshold descriptor */
> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> + rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> + return 0;
> +
> + n = txq->tx_rs_thresh;
> +
> + /* first buffer to free from S/W ring is at index
> + * tx_next_dd - (tx_rs_thresh-1)
> + */
> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> + rxep = rxq_rearm_data->rx_sw_ring;
> + rxep += *rxq_rearm_data->rearm_start;
> +
> + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> + /* directly put mbufs from Tx to Rx */
> + for (i = 0; i < n; i++, rxep++, txep++)
> + *rxep = txep[0].mbuf;
> + } else {
> + for (i = 0; i < n; i++, rxep++) {
> + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> + if (m != NULL) {
> + *rxep = m;
> + nb_rearm++;
> + }
> + }
> + n = nb_rearm;
> + }
> +
> + /* update counters for Tx */
> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> +
> + return n;
> +}
> +
> static __rte_always_inline void
> tx_backlog_entry(struct i40e_tx_entry *txep,
> struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> index 12e6f1cbcb..1509d3223b 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> @@ -739,6 +739,48 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
> return nb_pkts;
> }
>
> +int
> +i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm)
> +{
> + struct i40e_rx_queue *rxq = rx_queue;
> + struct i40e_rx_entry *rxep;
> + volatile union i40e_rx_desc *rxdp;
> + uint16_t rx_id;
> + uint64x2_t dma_addr;
> + uint64_t paddr;
> + uint16_t i;
> +
> + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> +
> + for (i = 0; i < nb_rearm; i++) {
> + /* Initialize rxdp descs */
> + paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
> + dma_addr = vdupq_n_u64(paddr);
> + /* flush desc with pa dma_addr */
> + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr);
> + }
> +
> + /* Update the descriptor initializer index */
> + rxq->rxrearm_start += nb_rearm;
> + rx_id = rxq->rxrearm_start - 1;
> +
> + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> + if (!rxq->rxrearm_start)
> + rx_id = rxq->nb_rx_desc - 1;
> + else
> + rx_id = rxq->rxrearm_start - 1;
> + }
> + rxq->rxrearm_nb -= nb_rearm;
> +
> + rte_io_wmb();
> + /* Update the tail pointer on the NIC */
> + I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> +
> + return 0;
> +}
> +
> void __rte_cold
> i40e_rx_queue_release_mbufs_vec(struct i40e_rx_queue *rxq)
> {
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 10:11 ` Morten Brørup
@ 2023-02-24 8:55 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-02-24 8:55 UTC (permalink / raw)
To: Morten Brørup, thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli,
Ruifeng Wang, nd, nd
Sorry for my delayed reply.
> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Wednesday, January 4, 2023 6:11 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> > From: Feifei Wang [mailto:Feifei.Wang2@arm.com]
> > Sent: Wednesday, 4 January 2023 09.51
> >
> > Hi, Morten
> >
> > > 发件人: Morten Brørup <mb@smartsharesystems.com>
> > > 发送时间: Wednesday, January 4, 2023 4:22 PM
> > >
> > > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > > Sent: Wednesday, 4 January 2023 08.31
> > > >
> > > > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > rearm
> > > > mode for separate Rx and Tx Operation. And this can support
> > different
> > > > multiple sources in direct rearm mode. For examples, Rx driver is
> > > > ixgbe, and Tx driver is i40e.
> > > >
> > > > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > >
> > > This feature looks very promising for performance. I am pleased to
> > see
> > > progress on it.
> > >
> > Thanks very much for your reviewing.
> >
> > > Please confirm that the fast path functions are still thread safe,
> > i.e. one EAL
> > > thread may be calling rte_eth_rx_burst() while another EAL thread is
> > calling
> > > rte_eth_tx_burst().
> > >
> > For the multiple threads safe, like we say in cover letter, current
> > direct-rearm support Rx and Tx in the same thread. If we consider
> > multiple threads like 'pipeline model', there need to add 'lock' in
> > the data path which can decrease the performance.
> > Thus, the first step we do is try to enable direct-rearm in the single
> > thread, and then we will consider to enable direct rearm in multiple
> > threads and improve the performance.
>
> OK, doing it in steps is a good idea for a feature like this - makes it easier to
> understand and review.
>
> When proceeding to add support for the "pipeline model", perhaps the
> lockless principles from the rte_ring can be used in this feature too.
>
> From a high level perspective, I'm somewhat worried that releasing a "work-
> in-progress" version of this feature in some DPDK version will cause API/ABI
> breakage discussions when progressing to the next steps of the
> implementation to make the feature more complete. Not only support for
> thread safety across simultaneous RX and TX, but also support for multiple
> mbuf pools per RX queue [1]. Marking the functions experimental should
> alleviate such discussions, but there is a risk of pushback to not break the
> API/ABI anyway.
>
> [1]:
> https://elixir.bootlin.com/dpdk/v22.11.1/source/lib/ethdev/rte_ethdev.h#L1
> 105
>
[Feifei] I think the subsequent upgrade does not significantly damage the stability
of the API we currently define.
For thread safety across simultaneous RX and TX, in the future, the lockless operation
change will happen in the pmd layer, such as CAS load/store for rxq queue index of pmd.
Thus, this can not affect the stability of the upper API.
For multiple mbuf pools per RX queue, direct-rearm just put Tx buffers into Rx buffers, and
it do not care which mempool the buffer coming from.
From different mempool buffers eventually freed into their respective sources in the
no FAST_FREE path.
I think this is a mistake in cover letter. Previous direct-rearm can just support FAST_FREE
so it constraint that buffer should be from the same mempool. Now, the latest version can
support no_FAST_FREE path, but we forget to make change in cover letter.
> [...]
>
> > > > --- a/lib/ethdev/ethdev_driver.h
> > > > +++ b/lib/ethdev/ethdev_driver.h
> > > > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > > > eth_rx_descriptor_status_t rx_descriptor_status;
> > > > /** Check the status of a Tx descriptor */
> > > > eth_tx_descriptor_status_t tx_descriptor_status;
> > > > + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > > > + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> > >
> > > What is "Rx sw-ring"? Please confirm that this is not an Intel PMD
> > specific
> > > term and/or implementation detail, e.g. by providing a conceptual
> > > implementation for a non-Intel PMD, e.g. mlx5.
> > Rx sw_ring is used to store mbufs in intel PMD. This is the same as
> > 'rxq->elts'
> > in mlx5.
>
> Sounds good.
>
> Then all we need is consensus on a generic name for this, unless "Rx sw-ring"
> already is the generic name. (I'm not a PMD developer, so I might be
> completely off track here.) Naming is often debatable, so I'll stop talking
> about it now - I only wanted to highlight that we should avoid vendor-
> specific terms in public APIs intended to be implemented by multiple vendors.
> On the other hand... if no other vendors raise their voices before merging
> into the DPDK main repository, they forfeit their right to complain about it. ;-)
>
> > Agree with that we need to providing a conceptual implementation for
> > all PMDs.
>
> My main point is that we should ensure that the feature is not too tightly
> coupled with the way Intel PMDs implement mbuf handling. Providing a
> conceptual implementation for a non-Intel PMD is one way of checking this.
>
> The actual implementation in other PMDs could be left up to the various NIC
> vendors.
Yes. And we will rename our API to make it suitable for all vendors:
rte_eth_direct_rearm -> rte_eth_buf_cycle (upper API for direct rearm)
rte_eth_tx_fill_sw_ring -> rte_eth_tx_buf_stash (Tx queue fill Rx ring buffer )
rte_eth_rx_flush_descriptor -> rte_eth_rx_descriptors_refill (Rx queue flush its descriptors)
rte_eth_rxq_rearm_data {
void *rx_sw_ring;
uint16_t *rearm_start;
uint16_t *rearm_nb;
}
->
struct *rxq_recycle_info {
rte_mbuf **buf_ring;
uint16_t *offset = (uint16 *)(&rq-<ci);
uint16_t *end;
uint16_t ring_size;
}
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-02-02 14:33 ` Konstantin Ananyev
@ 2023-02-24 9:45 ` Feifei Wang
2023-02-27 19:31 ` Konstantin Ananyev
0 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-02-24 9:45 UTC (permalink / raw)
To: Konstantin Ananyev, thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
Hi, Konstantin
Thanks for your reviewing and sorry for my delayed response.
For your comments, we put forward several improvement plans below.
Best Regards
Feifei
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Thursday, February 2, 2023 10:33 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; thomas@monjalon.net;
> Ferruh Yigit <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> Hi Feifei,
>
> > Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> > mode for separate Rx and Tx Operation. And this can support different
> > multiple sources in direct rearm mode. For examples, Rx driver is
> > ixgbe, and Tx driver is i40e.
>
>
> Thanks for your effort and thanks for taking comments provided into
> consideration.
> That approach looks much better then previous ones.
> Few nits below.
> Konstantin
>
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > lib/ethdev/ethdev_driver.h | 10 ++
> > lib/ethdev/ethdev_private.c | 2 +
> > lib/ethdev/rte_ethdev.c | 52 +++++++++++
> > lib/ethdev/rte_ethdev.h | 174
> +++++++++++++++++++++++++++++++++++
> > lib/ethdev/rte_ethdev_core.h | 11 +++
> > lib/ethdev/version.map | 6 ++
> > 6 files changed, 255 insertions(+)
> >
> > diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> > index 6a550cfc83..bc539ec862 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -59,6 +59,10 @@ struct rte_eth_dev {
> > eth_rx_descriptor_status_t rx_descriptor_status;
> > /** Check the status of a Tx descriptor */
> > eth_tx_descriptor_status_t tx_descriptor_status;
> > + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> > + /** Flush Rx descriptor in direct rearm mode */
> > + eth_rx_flush_descriptor_t rx_flush_descriptor;
> >
> > /**
> > * Device data that is shared between primary and secondary
> > processes @@ -504,6 +508,10 @@ typedef void
> (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
> > typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
> > uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
> >
> > +/**< @internal Get rearm data for a receive queue of an Ethernet
> > +device. */ typedef void (*eth_rxq_rearm_data_get_t)(struct rte_eth_dev
> *dev,
> > + uint16_t tx_queue_id, struct rte_eth_rxq_rearm_data
> > +*rxq_rearm_data);
> > +
> > typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
> > uint16_t queue_id, struct rte_eth_burst_mode *mode);
> >
> > @@ -1215,6 +1223,8 @@ struct eth_dev_ops {
> > eth_rxq_info_get_t rxq_info_get;
> > /** Retrieve Tx queue information */
> > eth_txq_info_get_t txq_info_get;
> > + /** Get Rx queue rearm data */
> > + eth_rxq_rearm_data_get_t rxq_rearm_data_get;
> > eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst
> mode */
> > eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst
> mode */
> > eth_fw_version_get_t fw_version_get; /**< Get firmware version
> */
> > diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
> > index 48090c879a..c5dd5e30f6 100644
> > --- a/lib/ethdev/ethdev_private.c
> > +++ b/lib/ethdev/ethdev_private.c
> > @@ -276,6 +276,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
> > fpo->rx_queue_count = dev->rx_queue_count;
> > fpo->rx_descriptor_status = dev->rx_descriptor_status;
> > fpo->tx_descriptor_status = dev->tx_descriptor_status;
> > + fpo->tx_fill_sw_ring = dev->tx_fill_sw_ring;
> > + fpo->rx_flush_descriptor = dev->rx_flush_descriptor;
> >
> > fpo->rxq.data = dev->data->rx_queues;
> > fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 5d5e18db1e..2af5cb42fe 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -3282,6 +3282,21 @@
> rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id, uint16_t
> rx_queue_id,
> > stat_idx, STAT_QMAP_RX));
> > }
> >
> > +int
> > +rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_rx_queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > + int nb_rearm = 0;
> > +
> > + nb_rearm = rte_eth_tx_fill_sw_ring(tx_port_id, tx_rx_queue_id,
> > +rxq_rearm_data);
> > +
> > + if (nb_rearm > 0)
> > + return rte_eth_rx_flush_descriptor(rx_port_id, rx_queue_id,
> > +nb_rearm);
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_eth_dev_fw_version_get(uint16_t port_id, char *fw_version, size_t
> fw_size)
> > {
> > @@ -5323,6 +5338,43 @@ rte_eth_tx_queue_info_get(uint16_t port_id,
> uint16_t queue_id,
> > return 0;
> > }
> >
> > +int
> > +rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > + struct rte_eth_dev *dev;
> > +
> > + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> > + dev = &rte_eth_devices[port_id];
> > +
> > + if (queue_id >= dev->data->nb_rx_queues) {
> > + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n",
> queue_id);
> > + return -EINVAL;
> > + }
> > +
> > + if (rxq_rearm_data == NULL) {
> > + RTE_ETHDEV_LOG(ERR, "Cannot get ethdev port %u Rx
> queue %u rearm data to NULL\n",
> > + port_id, queue_id);
> > + return -EINVAL;
> > + }
> > +
> > + if (dev->data->rx_queues == NULL ||
> > + dev->data->rx_queues[queue_id] == NULL) {
> > + RTE_ETHDEV_LOG(ERR,
> > + "Rx queue %"PRIu16" of device with port_id=%"
> > + PRIu16" has not been setup\n",
> > + queue_id, port_id);
> > + return -EINVAL;
> > + }
> > +
> > + if (*dev->dev_ops->rxq_rearm_data_get == NULL)
> > + return -ENOTSUP;
> > +
> > + dev->dev_ops->rxq_rearm_data_get(dev, queue_id,
> rxq_rearm_data);
> > +
> > + return 0;
> > +}
> > +
> > int
> > rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
> > struct rte_eth_burst_mode *mode) diff --git
> > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > c129ca1eaf..381c3d535f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
> > uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
> > } __rte_cache_min_aligned;
> >
> > +/**
> > + * @internal
> > + * Structure used to hold pointers to internal ethdev Rx rearm data.
> > + * The main purpose is to load Rx queue rearm data in direct rearm mode.
> > + */
>
> I think this structure owes a lot more expalantion.
> What each fields suppose to do and what are the constraints, etc.
>
> In general, more doc will be needed to explain that feature.
You are right. We will add more explanation for it and also update the doc
To explain the feature.
> > +struct rte_eth_rxq_rearm_data {
> > + void *rx_sw_ring;
>
> That's misleading, we always suppose to store mbufs ptrs here, so why not
> be direct:
> struct rte_mbuf **rx_sw_ring;
>
Agree.
> > + uint16_t *rearm_start;
> > + uint16_t *rearm_nb;
>
> I know that for Intel NICs uint16_t is sufficient, wonder would it always be
> for other vendors?
> Another thing to consider the case when ring position wrapping?
> Again I know that it is not required for Intel NICs, but would it be sufficient
> for API that supposed to be general?
>
For this, we re-define this structure:
rte_eth_rxq_rearm_data {
void *rx_sw_ring;
uint16_t *rearm_start;
uint16_t *rearm_nb;
}
->
struct *rxq_recycle_info {
rte_mbuf **buf_ring;
uint16_t *offset = (uint16 *)(&rq->ci);
uint16_t *end;
uint16_t ring_size;
}
For the new structure, *offset is a pointer for rearm-start index of
Rx buffer ring (consumer index). *end is a pointer for rearm-end index
Of Rx buffer ring (producer index).
1. we look up different pmds, some pmds using 'uint_16t' as index size like intel PMD,
some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]', and 'uint16_t'
is enough for ring size.
2. Good question. In general path, there is a constraint that 'nb_rearm < ring_size - rq->ci',
This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will refer to this to
solve ring wrapping.
>
> > +} __rte_cache_min_aligned;
> > +
> > /* Generic Burst mode flag definition, values can be ORed. */
> >
> > /**
> > @@ -3184,6 +3195,34 @@ int
> rte_eth_dev_set_rx_queue_stats_mapping(uint16_t port_id,
> > uint16_t rx_queue_id,
> > uint8_t stat_idx);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Directly put Tx freed buffers into Rx sw-ring and flush desc.
> > + *
> > + * @param rx_port_id
> > + * Port identifying the receive side.
> > + * @param rx_queue_id
> > + * The index of the receive queue identifying the receive side.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + * @param tx_port_id
> > + * Port identifying the transmit side.
> > + * @param tx_queue_id
> > + * The index of the transmit queue identifying the transmit side.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + * @param rxq_rearm_data
> > + * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + * @return
> > + * - 0: Success
> > + */
> > +__rte_experimental
> > +int rte_eth_dev_direct_rearm(uint16_t rx_port_id, uint16_t rx_queue_id,
> > + uint16_t tx_port_id, uint16_t tx_queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
>
>
> I think we need one more parameter for that function 'uint16_t offset'
> or so.
> So _rearm_ will start to populate rx_sw_ring from *rearm_start + offset
> position. That way we can support populating from different sources.
> Or should 'offset' be part of truct rte_eth_rxq_rearm_data?
>
Agree, please see above, we will do the change in the structure.
> > +
> > /**
> > * Retrieve the Ethernet address of an Ethernet device.
> > *
> > @@ -4782,6 +4821,27 @@ int rte_eth_rx_queue_info_get(uint16_t
> port_id, uint16_t queue_id,
> > int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
> > struct rte_eth_txq_info *qinfo);
> >
> > +/**
> > + * Get rearm data about given ports's Rx queue.
> > + *
> > + * @param port_id
> > + * The port identifier of the Ethernet device.
> > + * @param queue_id
> > + * The Rx queue on the Ethernet device for which rearm data
> > + * will be got.
> > + * @param rxq_rearm_data
> > + * A pointer to a structure of type *rte_eth_txq_rearm_data* to be filled.
> > + *
> > + * @return
> > + * - 0: Success
> > + * - -ENODEV: If *port_id* is invalid.
> > + * - -ENOTSUP: routine is not supported by the device PMD.
> > + * - -EINVAL: The queue_id is out of range.
> > + */
> > +__rte_experimental
> > +int rte_eth_rx_queue_rearm_data_get(uint16_t port_id, uint16_t
> queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> > +
> > /**
> > * Retrieve information about the Rx packet burst mode.
> > *
> > @@ -6103,6 +6163,120 @@ static inline int
> rte_eth_tx_descriptor_status(uint16_t port_id,
> > return (*p->tx_descriptor_status)(qd, offset);
> > }
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Fill Rx sw-ring with Tx buffers in direct rearm mode.
> > + *
> > + * @param port_id
> > + * The port identifier of the Ethernet device.
> > + * @param queue_id
> > + * The index of the transmit queue.
> > + * The value must be in the range [0, nb_tx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + * @param rxq_rearm_data
> > + * A pointer to a structure of type *rte_eth_rxq_rearm_data* to be filled
> with
> > + * the rearm data of a receive queue.
> > + * @return
> > + * - The number buffers correct to be filled in the Rx sw-ring.
> > + * - (-EINVAL) bad port or queue.
> > + * - (-ENODEV) bad port.
> > + * - (-ENOTSUP) if the device does not support this function.
> > + *
> > + */
> > +__rte_experimental
> > +static inline int rte_eth_tx_fill_sw_ring(uint16_t port_id,
> > + uint16_t queue_id, struct rte_eth_rxq_rearm_data *rxq_rearm_data)
> {
> > + struct rte_eth_fp_ops *p;
> > + void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > + if (port_id >= RTE_MAX_ETHPORTS ||
> > + queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > + RTE_ETHDEV_LOG(ERR,
> > + "Invalid port_id=%u or queue_id=%u\n",
> > + port_id, queue_id);
> > + return -EINVAL;
> > + }
> > +#endif
> > +
> > + p = &rte_eth_fp_ops[port_id];
> > + qd = p->txq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_TX
> > + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> > +
> > + if (qd == NULL) {
> > + RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for
> port_id=%u\n",
> > + queue_id, port_id);
> > + return -ENODEV;
> > + }
> > +#endif
> > +
> > + if (p->tx_fill_sw_ring == NULL)
> > + return -ENOTSUP;
> > +
> > + return p->tx_fill_sw_ring(qd, rxq_rearm_data); }
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> > +notice
> > + *
> > + * Flush Rx descriptor in direct rearm mode.
> > + *
> > + * @param port_id
> > + * The port identifier of the Ethernet device.
> > + * @param queue_id
> > + * The index of the receive queue.
> > + * The value must be in the range [0, nb_rx_queue - 1] previously
> supplied
> > + * to rte_eth_dev_configure().
> > + *@param nb_rearm
> > + * The number of Rx sw-ring buffers need to be flushed.
> > + * @return
> > + * - (0) if successful.
> > + * - (-EINVAL) bad port or queue.
> > + * - (-ENODEV) bad port.
> > + * - (-ENOTSUP) if the device does not support this function.
> > + */
> > +__rte_experimental
> > +static inline int rte_eth_rx_flush_descriptor(uint16_t port_id,
> > + uint16_t queue_id, uint16_t nb_rearm) {
> > + struct rte_eth_fp_ops *p;
> > + void *qd;
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > + if (port_id >= RTE_MAX_ETHPORTS ||
> > + queue_id >= RTE_MAX_QUEUES_PER_PORT) {
> > + RTE_ETHDEV_LOG(ERR,
> > + "Invalid port_id=%u or queue_id=%u\n",
> > + port_id, queue_id);
> > + return -EINVAL;
> > + }
> > +#endif
> > +
> > + p = &rte_eth_fp_ops[port_id];
> > + qd = p->rxq.data[queue_id];
> > +
> > +#ifdef RTE_ETHDEV_DEBUG_RX
> > + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
> > +
> > + if (qd == NULL) {
> > + RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for
> port_id=%u\n",
> > + queue_id, port_id);
> > + return -ENODEV;
> > + }
> > +#endif
> > +
> > + if (p->rx_flush_descriptor == NULL)
> > + return -ENOTSUP;
> > +
> > + return p->rx_flush_descriptor(qd, nb_rearm); }
> > +
> > /**
> > * @internal
> > * Helper routine for rte_eth_tx_burst().
> > diff --git a/lib/ethdev/rte_ethdev_core.h
> > b/lib/ethdev/rte_ethdev_core.h index dcf8adab92..5ecb57f6f0 100644
> > --- a/lib/ethdev/rte_ethdev_core.h
> > +++ b/lib/ethdev/rte_ethdev_core.h
> > @@ -56,6 +56,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq,
> uint16_t offset);
> > /** @internal Check the status of a Tx descriptor */
> > typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t
> > offset);
> >
> > +/** @internal Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > +typedef int (*eth_tx_fill_sw_ring_t)(void *txq,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> > +
> > +/** @internal Flush Rx descriptor in direct rearm mode */ typedef int
> > +(*eth_rx_flush_descriptor_t)(void *rxq, uint16_t nb_rearm);
> > +
> > /**
> > * @internal
> > * Structure used to hold opaque pointers to internal ethdev Rx/Tx
> > @@ -90,6 +97,8 @@ struct rte_eth_fp_ops {
> > eth_rx_queue_count_t rx_queue_count;
> > /** Check the status of a Rx descriptor. */
> > eth_rx_descriptor_status_t rx_descriptor_status;
> > + /** Flush Rx descriptor in direct rearm mode */
> > + eth_rx_flush_descriptor_t rx_flush_descriptor;
> > /** Rx queues data. */
> > struct rte_ethdev_qdata rxq;
> > uintptr_t reserved1[3];
> > @@ -106,6 +115,8 @@ struct rte_eth_fp_ops {
> > eth_tx_prep_t tx_pkt_prepare;
> > /** Check the status of a Tx descriptor. */
> > eth_tx_descriptor_status_t tx_descriptor_status;
> > + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
> > + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
> > /** Tx queues data. */
> > struct rte_ethdev_qdata txq;
> > uintptr_t reserved2[3];
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > 17201fbe0f..f39f02a69b 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -298,6 +298,12 @@ EXPERIMENTAL {
> > rte_flow_get_q_aged_flows;
> > rte_mtr_meter_policy_get;
> > rte_mtr_meter_profile_get;
> > +
> > + # added in 23.03
> > + rte_eth_dev_direct_rearm;
> > + rte_eth_rx_flush_descriptor;
> > + rte_eth_rx_queue_rearm_data_get;
> > + rte_eth_tx_fill_sw_ring;
> > };
> >
> > INTERNAL {
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-02-02 14:37 ` Konstantin Ananyev
@ 2023-02-24 9:50 ` Feifei Wang
2023-02-27 19:35 ` Konstantin Ananyev
0 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-02-24 9:50 UTC (permalink / raw)
To: Konstantin Ananyev, Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Thursday, February 2, 2023 10:38 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Yuying Zhang
> <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> 主题: Re: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
>
> 04/01/2023 07:30, Feifei Wang пишет:
> > Add internal API to separate direct rearm operations between Rx and
> > Tx.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> > drivers/net/i40e/i40e_ethdev.c | 1 +
> > drivers/net/i40e/i40e_ethdev.h | 2 +
> > drivers/net/i40e/i40e_rxtx.c | 19 +++++++++
> > drivers/net/i40e/i40e_rxtx.h | 4 ++
> > drivers/net/i40e/i40e_rxtx_vec_common.h | 54
> +++++++++++++++++++++++++
> > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 +++++++++++++++++++
> > 6 files changed, 122 insertions(+)
> >
> > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > b/drivers/net/i40e/i40e_ethdev.c index 7726a89d99..29c1ce2470 100644
> > --- a/drivers/net/i40e/i40e_ethdev.c
> > +++ b/drivers/net/i40e/i40e_ethdev.c
> > @@ -497,6 +497,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops
> = {
> > .flow_ops_get = i40e_dev_flow_ops_get,
> > .rxq_info_get = i40e_rxq_info_get,
> > .txq_info_get = i40e_txq_info_get,
> > + .rxq_rearm_data_get = i40e_rxq_rearm_data_get,
> > .rx_burst_mode_get = i40e_rx_burst_mode_get,
> > .tx_burst_mode_get = i40e_tx_burst_mode_get,
> > .timesync_enable = i40e_timesync_enable,
> > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > b/drivers/net/i40e/i40e_ethdev.h index fe943a45ff..6a6a2a6d3c 100644
> > --- a/drivers/net/i40e/i40e_ethdev.h
> > +++ b/drivers/net/i40e/i40e_ethdev.h
> > @@ -1352,6 +1352,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev,
> uint16_t queue_id,
> > struct rte_eth_rxq_info *qinfo);
> > void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > struct rte_eth_txq_info *qinfo);
> > +void i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t
> queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data);
> > int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > struct rte_eth_burst_mode *mode);
> > int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t
> > queue_id, diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index 788ffb51c2..d8d801acaf 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -3197,6 +3197,19 @@ i40e_txq_info_get(struct rte_eth_dev *dev,
> uint16_t queue_id,
> > qinfo->conf.offloads = txq->offloads;
> > }
> >
> > +void
> > +i40e_rxq_rearm_data_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > + struct i40e_rx_queue *rxq;
> > +
> > + rxq = dev->data->rx_queues[queue_id];
> > +
> > + rxq_rearm_data->rx_sw_ring = rxq->sw_ring;
> > + rxq_rearm_data->rearm_start = &rxq->rxrearm_start;
> > + rxq_rearm_data->rearm_nb = &rxq->rxrearm_nb; }
> > +
> > #ifdef RTE_ARCH_X86
> > static inline bool
> > get_avx_supported(bool request_avx512) @@ -3321,6 +3334,9 @@
> > i40e_set_rx_function(struct rte_eth_dev *dev)
> > PMD_INIT_LOG(DEBUG, "Using Vector Rx (port %d).",
> > dev->data->port_id);
> > dev->rx_pkt_burst = i40e_recv_pkts_vec;
> > +#ifdef RTE_ARCH_ARM64
> > + dev->rx_flush_descriptor =
> i40e_rx_flush_descriptor_vec; #endif
> > }
> > #endif /* RTE_ARCH_X86 */
> > } else if (!dev->data->scattered_rx && ad->rx_bulk_alloc_allowed) {
> > @@ -3484,6 +3500,9 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
> > PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
> > dev->data->port_id);
> > dev->tx_pkt_burst = i40e_xmit_pkts_vec;
> > +#ifdef RTE_ARCH_ARM64
> > + dev->tx_fill_sw_ring = i40e_tx_fill_sw_ring; #endif
>
> As I can see tx_fill_sw_ring() is non ARM specific, any reason to guard it with
> #ifdef ARM?
> Actually same ask for rx_flush_descriptor() - can we have generic version too?
Here we consider direct-rearm not enable in other architecture. Agree with that
we need to have generic version to avoid this, I will update in the next version.
>
> > #endif /* RTE_ARCH_X86 */
> > } else {
> > PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
> diff --git
> > a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h index
> > 5e6eecc501..8a29bd89df 100644
> > --- a/drivers/net/i40e/i40e_rxtx.h
> > +++ b/drivers/net/i40e/i40e_rxtx.h
> > @@ -233,6 +233,10 @@ uint32_t i40e_dev_rx_queue_count(void
> *rx_queue);
> > int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
> > int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
> >
> > +int i40e_tx_fill_sw_ring(void *tx_queue,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data); int
> > +i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm);
> > +
> > uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
> > uint16_t nb_pkts);
> > uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue, diff --git
> > a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > index fe1a6ec75e..eb96301a43 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > @@ -146,6 +146,60 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> > return txq->tx_rs_thresh;
> > }
> >
> > +int
> > +i40e_tx_fill_sw_ring(void *tx_queue,
> > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > + struct i40e_tx_queue *txq = tx_queue;
> > + struct i40e_tx_entry *txep;
> > + void **rxep;
> > + struct rte_mbuf *m;
> > + int i, n;
> > + int nb_rearm = 0;
> > +
> > + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> > + txq->nb_tx_free > txq->tx_free_thresh)
> > + return 0;
> > +
> > + /* check DD bits on threshold descriptor */
> > + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > +
> rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > + return 0;
> > +
> > + n = txq->tx_rs_thresh;
> > +
> > + /* first buffer to free from S/W ring is at index
> > + * tx_next_dd - (tx_rs_thresh-1)
> > + */
> > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > + rxep = rxq_rearm_data->rx_sw_ring;
> > + rxep += *rxq_rearm_data->rearm_start;
> > +
> > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > + /* directly put mbufs from Tx to Rx */
> > + for (i = 0; i < n; i++, rxep++, txep++)
> > + *rxep = txep[0].mbuf;
> > + } else {
> > + for (i = 0; i < n; i++, rxep++) {
> > + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > + if (m != NULL) {
> > + *rxep = m;
> > + nb_rearm++;
> > + }
> > + }
> > + n = nb_rearm;
> > + }
> > +
> > + /* update counters for Tx */
> > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > +
> > + return n;
> > +}
> > +
> > static __rte_always_inline void
> > tx_backlog_entry(struct i40e_tx_entry *txep,
> > struct rte_mbuf **tx_pkts, uint16_t nb_pkts) diff --git
> > a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > index 12e6f1cbcb..1509d3223b 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
> > @@ -739,6 +739,48 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict
> tx_queue,
> > return nb_pkts;
> > }
> >
> > +int
> > +i40e_rx_flush_descriptor_vec(void *rx_queue, uint16_t nb_rearm) {
> > + struct i40e_rx_queue *rxq = rx_queue;
> > + struct i40e_rx_entry *rxep;
> > + volatile union i40e_rx_desc *rxdp;
> > + uint16_t rx_id;
> > + uint64x2_t dma_addr;
> > + uint64_t paddr;
> > + uint16_t i;
> > +
> > + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> > + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> > +
> > + for (i = 0; i < nb_rearm; i++) {
> > + /* Initialize rxdp descs */
> > + paddr = (rxep[i].mbuf)->buf_iova +
> RTE_PKTMBUF_HEADROOM;
> > + dma_addr = vdupq_n_u64(paddr);
> > + /* flush desc with pa dma_addr */
> > + vst1q_u64((uint64_t *)&rxdp++->read, dma_addr);
> > + }
> > +
> > + /* Update the descriptor initializer index */
> > + rxq->rxrearm_start += nb_rearm;
> > + rx_id = rxq->rxrearm_start - 1;
> > +
> > + if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
> > + rxq->rxrearm_start = rxq->rxrearm_start - rxq->nb_rx_desc;
> > + if (!rxq->rxrearm_start)
> > + rx_id = rxq->nb_rx_desc - 1;
> > + else
> > + rx_id = rxq->rxrearm_start - 1;
> > + }
> > + rxq->rxrearm_nb -= nb_rearm;
> > +
> > + rte_io_wmb();
> > + /* Update the tail pointer on the NIC */
> > + I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
> > +
> > + return 0;
> > +}
> > +
> > void __rte_cold
> > i40e_rx_queue_release_mbufs_vec(struct i40e_rx_queue *rxq)
> > {
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-02-24 9:45 ` 回复: " Feifei Wang
@ 2023-02-27 19:31 ` Konstantin Ananyev
2023-02-28 2:16 ` 回复: " Feifei Wang
2023-02-28 8:09 ` Morten Brørup
0 siblings, 2 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2023-02-27 19:31 UTC (permalink / raw)
To: Feifei Wang, Konstantin Ananyev, thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
Hi Feifei ,
> > > + uint16_t *rearm_start;
> > > + uint16_t *rearm_nb;
> >
> > I know that for Intel NICs uint16_t is sufficient, wonder would it always be
> > for other vendors?
> > Another thing to consider the case when ring position wrapping?
> > Again I know that it is not required for Intel NICs, but would it be sufficient
> > for API that supposed to be general?
> >
> For this, we re-define this structure:
> rte_eth_rxq_rearm_data {
> void *rx_sw_ring;
> uint16_t *rearm_start;
> uint16_t *rearm_nb;
> }
> ->
> struct *rxq_recycle_info {
> rte_mbuf **buf_ring;
> uint16_t *offset = (uint16 *)(&rq->ci);
> uint16_t *end;
> uint16_t ring_size;
>
> }
> For the new structure, *offset is a pointer for rearm-start index of
> Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> Of Rx buffer ring (producer index).
>
> 1. we look up different pmds, some pmds using 'uint_16t' as index size like intel PMD,
> some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]', and 'uint16_t'
> is enough for ring size.
Sounds like a smart idea to me.
>
> 2. Good question. In general path, there is a constraint that 'nb_rearm < ring_size - rq->ci',
> This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will refer to this to
> solve ring wrapping.
Should work, I think...
Just need not to forget to document it :)
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-02-24 9:50 ` 回复: " Feifei Wang
@ 2023-02-27 19:35 ` Konstantin Ananyev
2023-02-28 2:15 ` 回复: " Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-02-27 19:35 UTC (permalink / raw)
To: Feifei Wang, Konstantin Ananyev, Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd
> > > +int
> > > +i40e_tx_fill_sw_ring(void *tx_queue,
> > > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > > + struct i40e_tx_queue *txq = tx_queue;
> > > + struct i40e_tx_entry *txep;
> > > + void **rxep;
> > > + struct rte_mbuf *m;
> > > + int i, n;
> > > + int nb_rearm = 0;
> > > +
> > > + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> > > + txq->nb_tx_free > txq->tx_free_thresh)
> > > + return 0;
> > > +
> > > + /* check DD bits on threshold descriptor */
> > > + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > +
> > rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > + return 0;
> > > +
> > > + n = txq->tx_rs_thresh;
> > > +
> > > + /* first buffer to free from S/W ring is at index
> > > + * tx_next_dd - (tx_rs_thresh-1)
> > > + */
> > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > > + rxep = rxq_rearm_data->rx_sw_ring;
> > > + rxep += *rxq_rearm_data->rearm_start;
> > > +
> > > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > + /* directly put mbufs from Tx to Rx */
> > > + for (i = 0; i < n; i++, rxep++, txep++)
> > > + *rxep = txep[0].mbuf;
> > > + } else {
> > > + for (i = 0; i < n; i++, rxep++) {
> > > + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
One thing I forgot to ask:
What would happen if this mbuf belongs to different mempool
(not one that we specify at rx_queue_setup())?
Do we need to check it here?
Or would it be upper layer constraint?
Or...?
> > > + if (m != NULL) {
> > > + *rxep = m;
> > > + nb_rearm++;
> > > + }
> > > + }
> > > + n = nb_rearm;
> > > + }
> > > +
> > > + /* update counters for Tx */
> > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > +
> > > + return n;
> > > +}
> > > +
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-02-27 19:35 ` Konstantin Ananyev
@ 2023-02-28 2:15 ` Feifei Wang
2023-03-07 11:01 ` Konstantin Ananyev
0 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-02-28 2:15 UTC (permalink / raw)
To: Konstantin Ananyev, Konstantin Ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd, nd
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 发送时间: Tuesday, February 28, 2023 3:36 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
>
>
>
> > > > +int
> > > > +i40e_tx_fill_sw_ring(void *tx_queue,
> > > > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > > > + struct i40e_tx_queue *txq = tx_queue;
> > > > + struct i40e_tx_entry *txep;
> > > > + void **rxep;
> > > > + struct rte_mbuf *m;
> > > > + int i, n;
> > > > + int nb_rearm = 0;
> > > > +
> > > > + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> > > > + txq->nb_tx_free > txq->tx_free_thresh)
> > > > + return 0;
> > > > +
> > > > + /* check DD bits on threshold descriptor */
> > > > + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > > +
> > > rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > > + return 0;
> > > > +
> > > > + n = txq->tx_rs_thresh;
> > > > +
> > > > + /* first buffer to free from S/W ring is at index
> > > > + * tx_next_dd - (tx_rs_thresh-1)
> > > > + */
> > > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > > > + rxep = rxq_rearm_data->rx_sw_ring;
> > > > + rxep += *rxq_rearm_data->rearm_start;
> > > > +
> > > > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > > + /* directly put mbufs from Tx to Rx */
> > > > + for (i = 0; i < n; i++, rxep++, txep++)
> > > > + *rxep = txep[0].mbuf;
> > > > + } else {
> > > > + for (i = 0; i < n; i++, rxep++) {
> > > > + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
>
> One thing I forgot to ask:
> What would happen if this mbuf belongs to different mempool (not one that
> we specify at rx_queue_setup())?
> Do we need to check it here?
> Or would it be upper layer constraint?
> Or...?
>
First, 'different mempool' is valid for no FAST_FREE path in tx_free_buffers.
If buffers belong to different mempool, we can have an example here:
Buffer 1 from mempool 1, its recycle path is:
-----------------------------------------------------------------------------------------
1. queue_setup: rearm from mempool 1 into Rx sw-ring
2. rte_eth_Rx_burst: used by user app (Rx)
3. rte_eth_Tx_burst: mount on Tx sw-ring
4. rte_eth_direct_rearm: free into Rx sw-ring:
or
tx_free_buffers: free into mempool 1 (no fast_free path)
-----------------------------------------------------------------------------------------
Buffer 2 from mempool 2, its recycle path is:
-----------------------------------------------------------------------------------------
1. queue_setup: rearm from mempool 2 into Rx sw-ring
2. rte_eth_Rx_burst: used by user app (Rx)
3. rte_eth_Tx_burst: mount on Tx sw-ring
4. rte_eth_direct_rearm: free into Rx sw-ring
or
tx_free_buffers: free into mempool 2 (no fast_free_path)
-----------------------------------------------------------------------------------------
Thus, buffers from Tx different mempools are the same for Rx. The difference point
is that they will be freed into different mempool if the thread uses generic free buffers.
I think this cannot affect direct-rearm mode, and we do not need to check this.
> > > > + if (m != NULL) {
> > > > + *rxep = m;
> > > > + nb_rearm++;
> > > > + }
> > > > + }
> > > > + n = nb_rearm;
> > > > + }
> > > > +
> > > > + /* update counters for Tx */
> > > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > > > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > > +
> > > > + return n;
> > > > +}
> > > > +
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-02-27 19:31 ` Konstantin Ananyev
@ 2023-02-28 2:16 ` Feifei Wang
2023-02-28 8:09 ` Morten Brørup
1 sibling, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-02-28 2:16 UTC (permalink / raw)
To: Konstantin Ananyev, Konstantin Ananyev, thomas, Ferruh Yigit,
Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 发送时间: Tuesday, February 28, 2023 3:32 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; thomas@monjalon.net; Ferruh Yigit
> <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
>
> Hi Feifei ,
>
>
> > > > + uint16_t *rearm_start;
> > > > + uint16_t *rearm_nb;
> > >
> > > I know that for Intel NICs uint16_t is sufficient, wonder would it
> > > always be for other vendors?
> > > Another thing to consider the case when ring position wrapping?
> > > Again I know that it is not required for Intel NICs, but would it be
> > > sufficient for API that supposed to be general?
> > >
> > For this, we re-define this structure:
> > rte_eth_rxq_rearm_data {
> > void *rx_sw_ring;
> > uint16_t *rearm_start;
> > uint16_t *rearm_nb;
> > }
> > ->
> > struct *rxq_recycle_info {
> > rte_mbuf **buf_ring;
> > uint16_t *offset = (uint16 *)(&rq->ci);
> > uint16_t *end;
> > uint16_t ring_size;
> >
> > }
> > For the new structure, *offset is a pointer for rearm-start index of
> > Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> > Of Rx buffer ring (producer index).
> >
> > 1. we look up different pmds, some pmds using 'uint_16t' as index
> > size like intel PMD, some pmds using 'uint32_t' as index size like MLX5 or
> thunderx PMD.
> > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -1)]',
> and 'uint16_t'
> > is enough for ring size.
>
> Sounds like a smart idea to me.
>
>
> >
> > 2. Good question. In general path, there is a constraint that
> > 'nb_rearm < ring_size - rq->ci', This can ensure no ring wrapping in
> > rearm. Thus in direct-rearm, we will refer to this to solve ring wrapping.
>
> Should work, I think...
> Just need not to forget to document it :)
Agree, we need to doc this.
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-02-27 19:31 ` Konstantin Ananyev
2023-02-28 2:16 ` 回复: " Feifei Wang
@ 2023-02-28 8:09 ` Morten Brørup
2023-03-01 7:34 ` 回复: " Feifei Wang
1 sibling, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2023-02-28 8:09 UTC (permalink / raw)
To: Konstantin Ananyev, Feifei Wang, Konstantin Ananyev, thomas,
Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
Hi Feifei,
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 27 February 2023 20.32
>
> Hi Feifei ,
>
>
> > > > + uint16_t *rearm_start;
> > > > + uint16_t *rearm_nb;
> > >
> > > I know that for Intel NICs uint16_t is sufficient, wonder would it always
> be
> > > for other vendors?
> > > Another thing to consider the case when ring position wrapping?
> > > Again I know that it is not required for Intel NICs, but would it be
> sufficient
> > > for API that supposed to be general?
> > >
> > For this, we re-define this structure:
> > rte_eth_rxq_rearm_data {
> > void *rx_sw_ring;
> > uint16_t *rearm_start;
> > uint16_t *rearm_nb;
> > }
> > ->
> > struct *rxq_recycle_info {
> > rte_mbuf **buf_ring;
> > uint16_t *offset = (uint16 *)(&rq->ci);
> > uint16_t *end;
> > uint16_t ring_size;
> >
> > }
> > For the new structure, *offset is a pointer for rearm-start index of
> > Rx buffer ring (consumer index). *end is a pointer for rearm-end index
> > Of Rx buffer ring (producer index).
> >
> > 1. we look up different pmds, some pmds using 'uint_16t' as index size like
> intel PMD,
> > some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset & (ring_size -
> 1)]', and 'uint16_t'
> > is enough for ring size.
>
> Sounds like a smart idea to me.
When configuring an Ethernet device queue, the nb_rx/tx_desc parameter to rte_eth_rx/tx_queue_setup() is uint16_t, so I agree that uint16_t should suffice here too.
I had the following thought, but am not sure. So please take this comment for consideration only:
I think the "& (ring_size -1)" is superfluous, unless a PMD allows its index pointer to exceed the ring size, and performs the same "& (ring_size -1)" when using the index pointer to access its ring.
And if a PMD uses the index pointer like that (i.e. exceeding the ring size), you would need the same wrap protection for a 16 bit index pointer.
>
>
> >
> > 2. Good question. In general path, there is a constraint that 'nb_rearm <
> ring_size - rq->ci',
> > This can ensure no ring wrapping in rearm. Thus in direct-rearm, we will
> refer to this to
> > solve ring wrapping.
>
> Should work, I think...
> Just need not to forget to document it :)
It is this constraint (the guarantee that there is no ring wrapping in a rearm burst) that makes me think that the "& (ring_size -1)" is superfluous.
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-02-28 8:09 ` Morten Brørup
@ 2023-03-01 7:34 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-03-01 7:34 UTC (permalink / raw)
To: Morten Brørup, Konstantin Ananyev, Konstantin Ananyev,
thomas, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd, nd
Hi, Morten
> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Tuesday, February 28, 2023 4:09 PM
> 收件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>; Feifei
> Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; thomas@monjalon.net; Ferruh Yigit
> <ferruh.yigit@amd.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> Hi Feifei,
>
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Monday, 27 February 2023 20.32
> >
> > Hi Feifei ,
> >
> >
> > > > > + uint16_t *rearm_start;
> > > > > + uint16_t *rearm_nb;
> > > >
> > > > I know that for Intel NICs uint16_t is sufficient, wonder would it
> > > > always
> > be
> > > > for other vendors?
> > > > Another thing to consider the case when ring position wrapping?
> > > > Again I know that it is not required for Intel NICs, but would it
> > > > be
> > sufficient
> > > > for API that supposed to be general?
> > > >
> > > For this, we re-define this structure:
> > > rte_eth_rxq_rearm_data {
> > > void *rx_sw_ring;
> > > uint16_t *rearm_start;
> > > uint16_t *rearm_nb;
> > > }
> > > ->
> > > struct *rxq_recycle_info {
> > > rte_mbuf **buf_ring;
> > > uint16_t *offset = (uint16 *)(&rq->ci);
> > > uint16_t *end;
> > > uint16_t ring_size;
> > >
> > > }
> > > For the new structure, *offset is a pointer for rearm-start index of
> > > Rx buffer ring (consumer index). *end is a pointer for rearm-end
> > > index Of Rx buffer ring (producer index).
> > >
> > > 1. we look up different pmds, some pmds using 'uint_16t' as index
> > > size like
> > intel PMD,
> > > some pmds using 'uint32_t' as index size like MLX5 or thunderx PMD.
> > > For pmd using 'uint32_t', rearm starts at 'buf_ring[offset &
> > > (ring_size -
> > 1)]', and 'uint16_t'
> > > is enough for ring size.
> >
> > Sounds like a smart idea to me.
>
> When configuring an Ethernet device queue, the nb_rx/tx_desc parameter
> to rte_eth_rx/tx_queue_setup() is uint16_t, so I agree that uint16_t should
> suffice here too.
>
> I had the following thought, but am not sure. So please take this comment
> for consideration only:
>
> I think the "& (ring_size -1)" is superfluous, unless a PMD allows its index
> pointer to exceed the ring size, and performs the same "& (ring_size -1)"
> when using the index pointer to access its ring.
>
> And if a PMD uses the index pointer like that (i.e. exceeding the ring size),
> you would need the same wrap protection for a 16 bit index pointer.
>
> >
> >
> > >
> > > 2. Good question. In general path, there is a constraint that
> > > 'nb_rearm <
> > ring_size - rq->ci',
> > > This can ensure no ring wrapping in rearm. Thus in direct-rearm, we
> > > will
> > refer to this to
> > > solve ring wrapping.
> >
> > Should work, I think...
> > Just need not to forget to document it :)
>
> It is this constraint (the guarantee that there is no ring wrapping in a rearm
> burst) that makes me think that the "& (ring_size -1)" is superfluous.
Actually there is some misleading here, I can explain about this.
(ring_size -1 ) is for some pmds whose index can exceed the ring_size, such as MLX5:
uint16_t *offset = &mlx5_rxq->rq_ci, rq_ci is an index which can exceed the ring_size.
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-01-04 8:21 ` Morten Brørup
2023-01-04 8:51 ` 回复: " Feifei Wang
@ 2023-03-06 12:49 ` Ferruh Yigit
2023-03-06 13:26 ` Morten Brørup
1 sibling, 1 reply; 145+ messages in thread
From: Ferruh Yigit @ 2023-03-06 12:49 UTC (permalink / raw)
To: Morten Brørup, Feifei Wang, Thomas Monjalon,
Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang
On 1/4/2023 8:21 AM, Morten Brørup wrote:
>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>> Sent: Wednesday, 4 January 2023 08.31
>>
>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
>> mode for separate Rx and Tx Operation. And this can support different
>> multiple sources in direct rearm mode. For examples, Rx driver is
>> ixgbe,
>> and Tx driver is i40e.
>>
>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>> ---
>
> This feature looks very promising for performance. I am pleased to see progress on it.
>
Hi Morten,
Yes it brings some performance, but not to generic use case, only to
specific and constraint use case.
And changes are relatively invasive comparing the usecase it supports,
like it adds new two inline datapath functions and a new dev_ops.
I am worried the unnecessary complexity and possible regressions in the
fundamental and simple parts of the project, with a good intention to
gain a few percentage performance in a specific usecase, can hurt the
project.
I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
is just an offload benefiting from existing offload infrastructure,
which requires very small update and logically change in application and
simple to implement in the drivers. So, they are not same from
complexity perspective.
Briefly, I am not comfortable with this change, I would like to see an
explicit approval and code review from techboard to proceed.
> Please confirm that the fast path functions are still thread safe, i.e. one EAL thread may be calling rte_eth_rx_burst() while another EAL thread is calling rte_eth_tx_burst().
>
> A few more comments below.
>
>> lib/ethdev/ethdev_driver.h | 10 ++
>> lib/ethdev/ethdev_private.c | 2 +
>> lib/ethdev/rte_ethdev.c | 52 +++++++++++
>> lib/ethdev/rte_ethdev.h | 174 +++++++++++++++++++++++++++++++++++
>> lib/ethdev/rte_ethdev_core.h | 11 +++
>> lib/ethdev/version.map | 6 ++
>> 6 files changed, 255 insertions(+)
>>
>> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
>> index 6a550cfc83..bc539ec862 100644
>> --- a/lib/ethdev/ethdev_driver.h
>> +++ b/lib/ethdev/ethdev_driver.h
>> @@ -59,6 +59,10 @@ struct rte_eth_dev {
>> eth_rx_descriptor_status_t rx_descriptor_status;
>> /** Check the status of a Tx descriptor */
>> eth_tx_descriptor_status_t tx_descriptor_status;
>> + /** Fill Rx sw-ring with Tx buffers in direct rearm mode */
>> + eth_tx_fill_sw_ring_t tx_fill_sw_ring;
>
> What is "Rx sw-ring"? Please confirm that this is not an Intel PMD specific term and/or implementation detail, e.g. by providing a conceptual implementation for a non-Intel PMD, e.g. mlx5.
>
> Please note: I do not request the ability to rearm between drivers from different vendors, I only request that the public ethdev API uses generic terms and concepts, so any NIC vendor can implement the direct-rearm functions in their PMDs.
>
>> + /** Flush Rx descriptor in direct rearm mode */
>> + eth_rx_flush_descriptor_t rx_flush_descriptor;
>
> descriptor -> descriptors. There are more than one. Both in comment and function name.
>
> [...]
>
>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>> index c129ca1eaf..381c3d535f 100644
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -1818,6 +1818,17 @@ struct rte_eth_txq_info {
>> uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
>> } __rte_cache_min_aligned;
>>
>> +/**
>> + * @internal
>> + * Structure used to hold pointers to internal ethdev Rx rearm data.
>> + * The main purpose is to load Rx queue rearm data in direct rearm
>> mode.
>> + */
>> +struct rte_eth_rxq_rearm_data {
>> + void *rx_sw_ring;
>> + uint16_t *rearm_start;
>> + uint16_t *rearm_nb;
>> +} __rte_cache_min_aligned;
>
> Please add descriptions to the fields in this structure.
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-06 12:49 ` Ferruh Yigit
@ 2023-03-06 13:26 ` Morten Brørup
2023-03-06 14:53 ` 回复: " Feifei Wang
2023-03-06 15:02 ` Ferruh Yigit
0 siblings, 2 replies; 145+ messages in thread
From: Morten Brørup @ 2023-03-06 13:26 UTC (permalink / raw)
To: Ferruh Yigit, Feifei Wang, Thomas Monjalon, Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang
> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> Sent: Monday, 6 March 2023 13.49
>
> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >> Sent: Wednesday, 4 January 2023 08.31
> >>
> >> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
> >> mode for separate Rx and Tx Operation. And this can support different
> >> multiple sources in direct rearm mode. For examples, Rx driver is
> >> ixgbe,
> >> and Tx driver is i40e.
> >>
> >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >> ---
> >
> > This feature looks very promising for performance. I am pleased to see
> progress on it.
> >
>
> Hi Morten,
>
> Yes it brings some performance, but not to generic use case, only to
> specific and constraint use case.
I got the impression that the supported use case is a prominent and important use case.
This is the primary argument for considering such a complex non-generic feature.
>
> And changes are relatively invasive comparing the usecase it supports,
> like it adds new two inline datapath functions and a new dev_ops.
>
> I am worried the unnecessary complexity and possible regressions in the
> fundamental and simple parts of the project, with a good intention to
> gain a few percentage performance in a specific usecase, can hurt the
> project.
>
>
> I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
> is just an offload benefiting from existing offload infrastructure,
> which requires very small update and logically change in application and
> simple to implement in the drivers. So, they are not same from
> complexity perspective.
>
> Briefly, I am not comfortable with this change, I would like to see an
> explicit approval and code review from techboard to proceed.
I agree that the complexity is very high, and thus requires extra consideration. Your suggested techboard review and approval process seems like a good solution.
And the performance benefit of direct rearm should be compared to the performance using the new zero-copy mempool API.
-Morten
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-06 13:26 ` Morten Brørup
@ 2023-03-06 14:53 ` Feifei Wang
2023-03-06 15:02 ` Ferruh Yigit
1 sibling, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-03-06 14:53 UTC (permalink / raw)
To: Morten Brørup, Ferruh Yigit, thomas, Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang, nd
Hi, Morten, Ferruh
Thanks very much for your reviewing.
Whatever the worries about direct rearm or comments to improve direct rearm, these urge
us to learn more and think more. I think it is beneficial and a good achievement for this exploration.
I will update the latest version for techboard code review. During this time,
I need some time to do performance test for the latest version.
So it should not catch up with this week's meeting and will be done before the techboard
meeting in two weeks.
Best Regards
Feifei
> -----邮件原件-----
> 发件人: Morten Brørup <mb@smartsharesystems.com>
> 发送时间: Monday, March 6, 2023 9:26 PM
> 收件人: Ferruh Yigit <ferruh.yigit@amd.com>; Feifei Wang
> <Feifei.Wang2@arm.com>; thomas@monjalon.net; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; techboard@dpdk.org
> 抄送: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > Sent: Monday, 6 March 2023 13.49
> >
> > On 1/4/2023 8:21 AM, Morten Brørup wrote:
> > >> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > >> Sent: Wednesday, 4 January 2023 08.31
> > >>
> > >> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > >> rearm mode for separate Rx and Tx Operation. And this can support
> > >> different multiple sources in direct rearm mode. For examples, Rx
> > >> driver is ixgbe, and Tx driver is i40e.
> > >>
> > >> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > >> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >> ---
> > >
> > > This feature looks very promising for performance. I am pleased to
> > > see
> > progress on it.
> > >
> >
> > Hi Morten,
> >
> > Yes it brings some performance, but not to generic use case, only to
> > specific and constraint use case.
>
> I got the impression that the supported use case is a prominent and
> important use case.
>
> This is the primary argument for considering such a complex non-generic
> feature.
>
> >
> > And changes are relatively invasive comparing the usecase it supports,
> > like it adds new two inline datapath functions and a new dev_ops.
> >
> > I am worried the unnecessary complexity and possible regressions in
> > the fundamental and simple parts of the project, with a good intention
> > to gain a few percentage performance in a specific usecase, can hurt
> > the project.
> >
> >
> > I can see this is compared to MBUF_FAST_FREE feature, but
> > MBUF_FAST_FREE is just an offload benefiting from existing offload
> > infrastructure, which requires very small update and logically change
> > in application and simple to implement in the drivers. So, they are
> > not same from complexity perspective.
> >
> > Briefly, I am not comfortable with this change, I would like to see an
> > explicit approval and code review from techboard to proceed.
>
> I agree that the complexity is very high, and thus requires extra consideration.
> Your suggested techboard review and approval process seems like a good
> solution.
>
> And the performance benefit of direct rearm should be compared to the
> performance using the new zero-copy mempool API.
>
> -Morten
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-06 13:26 ` Morten Brørup
2023-03-06 14:53 ` 回复: " Feifei Wang
@ 2023-03-06 15:02 ` Ferruh Yigit
2023-03-07 6:12 ` Honnappa Nagarahalli
1 sibling, 1 reply; 145+ messages in thread
From: Ferruh Yigit @ 2023-03-06 15:02 UTC (permalink / raw)
To: Morten Brørup, Feifei Wang, Thomas Monjalon,
Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Honnappa Nagarahalli, Ruifeng Wang
On 3/6/2023 1:26 PM, Morten Brørup wrote:
>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>> Sent: Monday, 6 March 2023 13.49
>>
>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>>> Sent: Wednesday, 4 January 2023 08.31
>>>>
>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct rearm
>>>> mode for separate Rx and Tx Operation. And this can support different
>>>> multiple sources in direct rearm mode. For examples, Rx driver is
>>>> ixgbe,
>>>> and Tx driver is i40e.
>>>>
>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> ---
>>>
>>> This feature looks very promising for performance. I am pleased to see
>> progress on it.
>>>
>>
>> Hi Morten,
>>
>> Yes it brings some performance, but not to generic use case, only to
>> specific and constraint use case.
>
> I got the impression that the supported use case is a prominent and important use case.
>
Can you please give real life samples for this use case, other than just
showing better performance number in the test bench? This helps to
understand the reasoning better.
> This is the primary argument for considering such a complex non-generic feature.
>
>>
>> And changes are relatively invasive comparing the usecase it supports,
>> like it adds new two inline datapath functions and a new dev_ops.
>>
>> I am worried the unnecessary complexity and possible regressions in the
>> fundamental and simple parts of the project, with a good intention to
>> gain a few percentage performance in a specific usecase, can hurt the
>> project.
>>
>>
>> I can see this is compared to MBUF_FAST_FREE feature, but MBUF_FAST_FREE
>> is just an offload benefiting from existing offload infrastructure,
>> which requires very small update and logically change in application and
>> simple to implement in the drivers. So, they are not same from
>> complexity perspective.
>>
>> Briefly, I am not comfortable with this change, I would like to see an
>> explicit approval and code review from techboard to proceed.
>
> I agree that the complexity is very high, and thus requires extra consideration. Your suggested techboard review and approval process seems like a good solution.
>
> And the performance benefit of direct rearm should be compared to the performance using the new zero-copy mempool API.
>
> -Morten
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-06 15:02 ` Ferruh Yigit
@ 2023-03-07 6:12 ` Honnappa Nagarahalli
2023-03-07 10:52 ` Konstantin Ananyev
2023-03-07 20:41 ` Ferruh Yigit
0 siblings, 2 replies; 145+ messages in thread
From: Honnappa Nagarahalli @ 2023-03-07 6:12 UTC (permalink / raw)
To: Ferruh Yigit, Morten Brørup, Feifei Wang, thomas,
Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Ruifeng Wang, nd
<snip>
>
> On 3/6/2023 1:26 PM, Morten Brørup wrote:
> >> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >> Sent: Monday, 6 March 2023 13.49
> >>
> >> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>>> Sent: Wednesday, 4 January 2023 08.31
> >>>>
> >>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> >>>> rearm mode for separate Rx and Tx Operation. And this can support
> >>>> different multiple sources in direct rearm mode. For examples, Rx
> >>>> driver is ixgbe, and Tx driver is i40e.
> >>>>
> >>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >>>> ---
> >>>
> >>> This feature looks very promising for performance. I am pleased to
> >>> see
> >> progress on it.
> >>>
> >>
> >> Hi Morten,
> >>
> >> Yes it brings some performance, but not to generic use case, only to
> >> specific and constraint use case.
> >
> > I got the impression that the supported use case is a prominent and important
> use case.
> >
>
> Can you please give real life samples for this use case, other than just showing
> better performance number in the test bench? This helps to understand the
> reasoning better.
The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads. There is no need of static configuration from control plane.
On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does not show any improvements.
>
> > This is the primary argument for considering such a complex non-generic
> feature.
I am not sure what is the complexity here, can you please elaborate?
I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.
> >
> >>
> >> And changes are relatively invasive comparing the usecase it
> >> supports, like it adds new two inline datapath functions and a new dev_ops.
> >>
> >> I am worried the unnecessary complexity and possible regressions in
> >> the fundamental and simple parts of the project, with a good
> >> intention to gain a few percentage performance in a specific usecase,
> >> can hurt the project.
I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures while addressing prominent new use cases.
Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and originally intended use cases can benefit from this feature.
> >>
> >>
> >> I can see this is compared to MBUF_FAST_FREE feature, but
> >> MBUF_FAST_FREE is just an offload benefiting from existing offload
> >> infrastructure, which requires very small update and logically change
> >> in application and simple to implement in the drivers. So, they are
> >> not same from complexity perspective.
> >>
> >> Briefly, I am not comfortable with this change, I would like to see
> >> an explicit approval and code review from techboard to proceed.
> >
> > I agree that the complexity is very high, and thus requires extra consideration.
> Your suggested techboard review and approval process seems like a good
> solution.
We can add to the agenda for the next Techboard meeting.
> >
> > And the performance benefit of direct rearm should be compared to the
> performance using the new zero-copy mempool API.
> >
> > -Morten
> >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-07 6:12 ` Honnappa Nagarahalli
@ 2023-03-07 10:52 ` Konstantin Ananyev
2023-03-07 20:41 ` Ferruh Yigit
1 sibling, 0 replies; 145+ messages in thread
From: Konstantin Ananyev @ 2023-03-07 10:52 UTC (permalink / raw)
To: Honnappa Nagarahalli, Ferruh Yigit, Morten Brørup,
Feifei Wang, thomas, Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Ruifeng Wang, nd
> > On 3/6/2023 1:26 PM, Morten Brørup wrote:
> > >> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > >> Sent: Monday, 6 March 2023 13.49
> > >>
> > >> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> > >>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > >>>> Sent: Wednesday, 4 January 2023 08.31
> > >>>>
> > >>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> > >>>> rearm mode for separate Rx and Tx Operation. And this can support
> > >>>> different multiple sources in direct rearm mode. For examples, Rx
> > >>>> driver is ixgbe, and Tx driver is i40e.
> > >>>>
> > >>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > >>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >>>> ---
> > >>>
> > >>> This feature looks very promising for performance. I am pleased to
> > >>> see
> > >> progress on it.
> > >>>
> > >>
> > >> Hi Morten,
> > >>
> > >> Yes it brings some performance, but not to generic use case, only to
> > >> specific and constraint use case.
> > >
> > > I got the impression that the supported use case is a prominent and important
> > use case.
> > >
> >
> > Can you please give real life samples for this use case, other than just showing
> > better performance number in the test bench? This helps to understand the
> > reasoning better.
> The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications
> with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
> However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that
> uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads.
> There is no need of static configuration from control plane.
+1 to this statement.
I think the authors did a good job to make it generic enough, so it can be used for many different cases,
plus AFAIU it doesn't introduce new implicit restrictions to the user.
Again, this feature is totally optional, so users are free to ignore it.
Personally I do not see any good reason why we shouldn't accept this feature into DPDK.
Off-course with more code reviews, extra testing, docs updates, etc..
>
> On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions
> when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does
> not show any improvements.
>
> >
> > > This is the primary argument for considering such a complex non-generic
> > feature.
> I am not sure what is the complexity here, can you please elaborate?
> I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.
>
> > >
> > >>
> > >> And changes are relatively invasive comparing the usecase it
> > >> supports, like it adds new two inline datapath functions and a new dev_ops.
> > >>
> > >> I am worried the unnecessary complexity and possible regressions in
> > >> the fundamental and simple parts of the project, with a good
> > >> intention to gain a few percentage performance in a specific usecase,
> > >> can hurt the project.
> I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were
> not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the
> original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures
> while addressing prominent new use cases.
> Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and
> originally intended use cases can benefit from this feature.
>
> > >>
> > >>
> > >> I can see this is compared to MBUF_FAST_FREE feature, but
> > >> MBUF_FAST_FREE is just an offload benefiting from existing offload
> > >> infrastructure, which requires very small update and logically change
> > >> in application and simple to implement in the drivers. So, they are
> > >> not same from complexity perspective.
> > >>
> > >> Briefly, I am not comfortable with this change, I would like to see
> > >> an explicit approval and code review from techboard to proceed.
> > >
> > > I agree that the complexity is very high, and thus requires extra consideration.
> > Your suggested techboard review and approval process seems like a good
> > solution.
> We can add to the agenda for the next Techboard meeting.
>
> > >
> > > And the performance benefit of direct rearm should be compared to the
> > performance using the new zero-copy mempool API.
> > >
> > > -Morten
> > >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-02-28 2:15 ` 回复: " Feifei Wang
@ 2023-03-07 11:01 ` Konstantin Ananyev
2023-03-14 6:07 ` 回复: " Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-03-07 11:01 UTC (permalink / raw)
To: Feifei Wang, Konstantin Ananyev, Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd, nd
> > -----邮件原件-----
> > 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > 发送时间: Tuesday, February 28, 2023 3:36 AM
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> > <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> > <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> > Wang <Ruifeng.Wang@arm.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > 主题: RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
> >
> >
> >
> > > > > +int
> > > > > +i40e_tx_fill_sw_ring(void *tx_queue,
> > > > > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > > > > + struct i40e_tx_queue *txq = tx_queue;
> > > > > + struct i40e_tx_entry *txep;
> > > > > + void **rxep;
> > > > > + struct rte_mbuf *m;
> > > > > + int i, n;
> > > > > + int nb_rearm = 0;
> > > > > +
> > > > > + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> > > > > + txq->nb_tx_free > txq->tx_free_thresh)
> > > > > + return 0;
> > > > > +
> > > > > + /* check DD bits on threshold descriptor */
> > > > > + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > > > + rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > > > +
> > > > rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > > > + return 0;
> > > > > +
> > > > > + n = txq->tx_rs_thresh;
> > > > > +
> > > > > + /* first buffer to free from S/W ring is at index
> > > > > + * tx_next_dd - (tx_rs_thresh-1)
> > > > > + */
> > > > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > > > > + rxep = rxq_rearm_data->rx_sw_ring;
> > > > > + rxep += *rxq_rearm_data->rearm_start;
> > > > > +
> > > > > + if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > > > + /* directly put mbufs from Tx to Rx */
> > > > > + for (i = 0; i < n; i++, rxep++, txep++)
> > > > > + *rxep = txep[0].mbuf;
> > > > > + } else {
> > > > > + for (i = 0; i < n; i++, rxep++) {
> > > > > + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> >
> > One thing I forgot to ask:
> > What would happen if this mbuf belongs to different mempool (not one that
> > we specify at rx_queue_setup())?
> > Do we need to check it here?
> > Or would it be upper layer constraint?
> > Or...?
> >
>
> First, 'different mempool' is valid for no FAST_FREE path in tx_free_buffers.
>
> If buffers belong to different mempool, we can have an example here:
> Buffer 1 from mempool 1, its recycle path is:
> -----------------------------------------------------------------------------------------
> 1. queue_setup: rearm from mempool 1 into Rx sw-ring
> 2. rte_eth_Rx_burst: used by user app (Rx)
> 3. rte_eth_Tx_burst: mount on Tx sw-ring
> 4. rte_eth_direct_rearm: free into Rx sw-ring:
> or
> tx_free_buffers: free into mempool 1 (no fast_free path)
> -----------------------------------------------------------------------------------------
>
> Buffer 2 from mempool 2, its recycle path is:
> -----------------------------------------------------------------------------------------
> 1. queue_setup: rearm from mempool 2 into Rx sw-ring
> 2. rte_eth_Rx_burst: used by user app (Rx)
> 3. rte_eth_Tx_burst: mount on Tx sw-ring
> 4. rte_eth_direct_rearm: free into Rx sw-ring
> or
> tx_free_buffers: free into mempool 2 (no fast_free_path)
> -----------------------------------------------------------------------------------------
>
> Thus, buffers from Tx different mempools are the same for Rx. The difference point
> is that they will be freed into different mempool if the thread uses generic free buffers.
> I think this cannot affect direct-rearm mode, and we do not need to check this.
I understand that it should work even with multiple mempools.
What I am trying to say - user may not want to use mbufs from particular mempool for RX
(while it is still ok to use it for TX).
Let say user can have a separate mempool with small data-buffers (less then normal MTU)
to send some 'special' paclets, or even use this memppol with small buffers for zero-copy
updating of packet L2/L3 headers, etc.
Or it could be some 'special' user provided mempool.
That's why I wonder should we allow only mbufs from mempool that is assigned to that RX queue.
>
> > > > > + if (m != NULL) {
> > > > > + *rxep = m;
> > > > > + nb_rearm++;
> > > > > + }
> > > > > + }
> > > > > + n = nb_rearm;
> > > > > + }
> > > > > +
> > > > > + /* update counters for Tx */
> > > > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
> > > > > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
> > > > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > > > +
> > > > > + return n;
> > > > > +}
> > > > > +
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-07 6:12 ` Honnappa Nagarahalli
2023-03-07 10:52 ` Konstantin Ananyev
@ 2023-03-07 20:41 ` Ferruh Yigit
2023-03-22 14:43 ` Honnappa Nagarahalli
1 sibling, 1 reply; 145+ messages in thread
From: Ferruh Yigit @ 2023-03-07 20:41 UTC (permalink / raw)
To: Honnappa Nagarahalli, Morten Brørup, Feifei Wang, thomas,
Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Ruifeng Wang
On 3/7/2023 6:12 AM, Honnappa Nagarahalli wrote:
> <snip>
>
>>
>> On 3/6/2023 1:26 PM, Morten Brørup wrote:
>>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>>>> Sent: Monday, 6 March 2023 13.49
>>>>
>>>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
>>>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
>>>>>> Sent: Wednesday, 4 January 2023 08.31
>>>>>>
>>>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
>>>>>> rearm mode for separate Rx and Tx Operation. And this can support
>>>>>> different multiple sources in direct rearm mode. For examples, Rx
>>>>>> driver is ixgbe, and Tx driver is i40e.
>>>>>>
>>>>>> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
>>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>>>> ---
>>>>>
>>>>> This feature looks very promising for performance. I am pleased to
>>>>> see
>>>> progress on it.
>>>>>
>>>>
>>>> Hi Morten,
>>>>
>>>> Yes it brings some performance, but not to generic use case, only to
>>>> specific and constraint use case.
>>>
>>> I got the impression that the supported use case is a prominent and important
>> use case.
>>>
>>
>> Can you please give real life samples for this use case, other than just showing
>> better performance number in the test bench? This helps to understand the
>> reasoning better.
> The very first patch started off with a constrained but prominent use case. Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports being used in tons of data centers is not a secret anymore and not a small use case that can be ignored.
> However, the design of the patch has changed significantly from then. Now the solution can be applied to any generic use case that uses run-to-completion model of DPDK. i.e. the mapping of the RX and TX ports can be done dynamically in the data plane threads. There is no need of static configuration from control plane.
>
> On the test bench, we need to make up our mind. When we see improvements, we say it is just a test bench. On other occasions when the test bench does not show any improvements (but improvements are shown by other metrics), we say the test bench does not show any improvements.
>
>>
>>> This is the primary argument for considering such a complex non-generic
>> feature.
> I am not sure what is the complexity here, can you please elaborate?
I am considering from user perspective.
OK, DPDK is already low level, but ethdev has only a handful of datapath
APIs (6 of them), and main ones are easy to comprehend:
rte_eth_rx_burst(port_id, queue_id, rx_pkts, nb_pkts);
rte_eth_tx_burst(port_id, queue_id, tx_pkts, nb_pkts);
They (magically) Rx/Tx buffers, easy to grasp.
Maybe rte_eth_tx_prepare() is a little less obvious (why/when to use
it), but still I believe simple.
Whoever looks to these APIs can figure out how to use in the application.
The other three is related to the descriptors and I am not sure about
their use-case, I assume they are mostly good for debugging.
But now we are adding new datapath APIs:
rte_eth_tx_fill_sw_ring(port_id, queue_id, rxq_rearm_data);
rte_eth_rx_flush_descriptor(port_id, queue_id, nb_rearm);
When you talk about SW ring and re-arming descriptors I believe you will
loose most of the users already, driver developers will know what it is,
you will know what that is, but people who are not close to the Ethernet
HW won't.
And these APIs will be very visible, not like one of many control plane
dev_ops. So this can confuse users who are not familiar with details.
Usage of these APIs comes with restrictions, it is possible that at some
percentage of users will miss these restrictions or miss-understand them
and will have issues.
Or many may be intimidated by them and stay away from using these APIs,
leaving them as a burden to maintain, to test, to fix. That is why I
think a real life usecase is needed, in that case at least we will know
some consumers will fix or let us know when they get broken.
It may be possible to hide details under driver and user only set an
offload flag, similar to FAST_FREE, but in that case feature will loose
flexibility and it will be even more specific, perhaps making it less
useful.
> I see other patches/designs (ex: proactive error recovery) which are way more complex to understand and comprehend.
>
>>>
>>>>
>>>> And changes are relatively invasive comparing the usecase it
>>>> supports, like it adds new two inline datapath functions and a new dev_ops.
>>>>
>>>> I am worried the unnecessary complexity and possible regressions in
>>>> the fundamental and simple parts of the project, with a good
>>>> intention to gain a few percentage performance in a specific usecase,
>>>> can hurt the project.
> I agree that we are touching some fundamental parts of the project. But, we also need to realize that those fundamental parts were not developed on architectures that have joined the project way later. Similarly, the use cases have evolved significantly from the original intended use cases. We cannot hold on to those fundamental designs if they affect the performance on other architectures while addressing prominent new use cases.
> Please note that this patch does not break any existing features or affect their performance in any negative way. The generic and originally intended use cases can benefit from this feature.
>
>>>>
>>>>
>>>> I can see this is compared to MBUF_FAST_FREE feature, but
>>>> MBUF_FAST_FREE is just an offload benefiting from existing offload
>>>> infrastructure, which requires very small update and logically change
>>>> in application and simple to implement in the drivers. So, they are
>>>> not same from complexity perspective.
>>>>
>>>> Briefly, I am not comfortable with this change, I would like to see
>>>> an explicit approval and code review from techboard to proceed.
>>>
>>> I agree that the complexity is very high, and thus requires extra consideration.
>> Your suggested techboard review and approval process seems like a good
>> solution.
> We can add to the agenda for the next Techboard meeting.
>
>>>
>>> And the performance benefit of direct rearm should be compared to the
>> performance using the new zero-copy mempool API.
>>>
>>> -Morten
>>>
^ permalink raw reply [flat|nested] 145+ messages in thread
* 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-03-07 11:01 ` Konstantin Ananyev
@ 2023-03-14 6:07 ` Feifei Wang
2023-03-19 16:11 ` Konstantin Ananyev
0 siblings, 1 reply; 145+ messages in thread
From: Feifei Wang @ 2023-03-14 6:07 UTC (permalink / raw)
To: Konstantin Ananyev, Konstantin Ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd, nd, nd
> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 发送时间: Tuesday, March 7, 2023 7:01 PM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
>
>
>
> > > -----邮件原件-----
> > > 发件人: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > > 发送时间: Tuesday, February 28, 2023 3:36 AM
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> > > <konstantin.v.ananyev@yandex.ru>; Yuying Zhang
> > > <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> > > Ruifeng Wang <Ruifeng.Wang@arm.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > > 主题: RE: [PATCH v3 2/3] net/i40e: enable direct rearm with separate
> > > API
> > >
> > >
> > >
> > > > > > +int
> > > > > > +i40e_tx_fill_sw_ring(void *tx_queue,
> > > > > > + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> > > > > > + struct i40e_tx_queue *txq = tx_queue;
> > > > > > + struct i40e_tx_entry *txep;
> > > > > > + void **rxep;
> > > > > > + struct rte_mbuf *m;
> > > > > > + int i, n;
> > > > > > + int nb_rearm = 0;
> > > > > > +
> > > > > > + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> > > > > > + txq->nb_tx_free > txq->tx_free_thresh)
> > > > > > + return 0;
> > > > > > +
> > > > > > + /* check DD bits on threshold descriptor */
> > > > > > + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> > > > > > +
> rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> > > > > > +
> > > > > rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> > > > > > + return 0;
> > > > > > +
> > > > > > + n = txq->tx_rs_thresh;
> > > > > > +
> > > > > > + /* first buffer to free from S/W ring is at index
> > > > > > + * tx_next_dd - (tx_rs_thresh-1)
> > > > > > + */
> > > > > > + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> > > > > > + rxep = rxq_rearm_data->rx_sw_ring;
> > > > > > + rxep += *rxq_rearm_data->rearm_start;
> > > > > > +
> > > > > > + if (txq->offloads &
> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > > > > > + /* directly put mbufs from Tx to Rx */
> > > > > > + for (i = 0; i < n; i++, rxep++, txep++)
> > > > > > + *rxep = txep[0].mbuf;
> > > > > > + } else {
> > > > > > + for (i = 0; i < n; i++, rxep++) {
> > > > > > + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > >
> > > One thing I forgot to ask:
> > > What would happen if this mbuf belongs to different mempool (not one
> > > that we specify at rx_queue_setup())?
> > > Do we need to check it here?
> > > Or would it be upper layer constraint?
> > > Or...?
> > >
> >
> > First, 'different mempool' is valid for no FAST_FREE path in tx_free_buffers.
> >
> > If buffers belong to different mempool, we can have an example here:
> > Buffer 1 from mempool 1, its recycle path is:
> > ----------------------------------------------------------------------
> > ------------------- 1. queue_setup: rearm from mempool 1 into Rx
> > sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
> > rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
> > into Rx sw-ring:
> > or
> > tx_free_buffers: free into mempool 1 (no fast_free path)
> > ----------------------------------------------------------------------
> > -------------------
> >
> > Buffer 2 from mempool 2, its recycle path is:
> > ----------------------------------------------------------------------
> > ------------------- 1. queue_setup: rearm from mempool 2 into Rx
> > sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
> > rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
> > into Rx sw-ring
> > or
> > tx_free_buffers: free into mempool 2 (no fast_free_path)
> > ----------------------------------------------------------------------
> > -------------------
> >
> > Thus, buffers from Tx different mempools are the same for Rx. The
> > difference point is that they will be freed into different mempool if the
> thread uses generic free buffers.
> > I think this cannot affect direct-rearm mode, and we do not need to check
> this.
>
> I understand that it should work even with multiple mempools.
> What I am trying to say - user may not want to use mbufs from particular
> mempool for RX (while it is still ok to use it for TX).
> Let say user can have a separate mempool with small data-buffers (less then
> normal MTU) to send some 'special' paclets, or even use this memppol with
> small buffers for zero-copy updating of packet L2/L3 headers, etc.
> Or it could be some 'special' user provided mempool.
> That's why I wonder should we allow only mbufs from mempool that is
> assigned to that RX queue.
Sorry for my misleading. If I understand correctly this time, you means a special
mempool. Maybe its buffer size is very small and this Tx buffer is generated from control plane.
However, if we recycle this Tx buffer into Rx buffer ring, there maybe some error due to its
size is so small.
Thus we can only allow general buffers which is valid for Rx buffer ring. Furthermore, this should be
user's responsibility to ensure the Tx recycling buffers should be valid. If we check this in the data plane,
it will cost a lot of CPU cycles. At last, what we can do is to add constraint in the notes to remind users.
>
> >
> > > > > > + if (m != NULL) {
> > > > > > + *rxep = m;
> > > > > > + nb_rearm++;
> > > > > > + }
> > > > > > + }
> > > > > > + n = nb_rearm;
> > > > > > + }
> > > > > > +
> > > > > > + /* update counters for Tx */
> > > > > > + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq-
> >tx_rs_thresh);
> > > > > > + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq-
> >tx_rs_thresh);
> > > > > > + if (txq->tx_next_dd >= txq->nb_tx_desc)
> > > > > > + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> > > > > > +
> > > > > > + return n;
> > > > > > +}
> > > > > > +
^ permalink raw reply [flat|nested] 145+ messages in thread
* Re: 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-03-14 6:07 ` 回复: " Feifei Wang
@ 2023-03-19 16:11 ` Konstantin Ananyev
2023-03-23 10:49 ` Feifei Wang
0 siblings, 1 reply; 145+ messages in thread
From: Konstantin Ananyev @ 2023-03-19 16:11 UTC (permalink / raw)
To: Feifei Wang, Konstantin Ananyev, Yuying Zhang, Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli
>>>>>>> +int
>>>>>>> +i40e_tx_fill_sw_ring(void *tx_queue,
>>>>>>> + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
>>>>>>> + struct i40e_tx_queue *txq = tx_queue;
>>>>>>> + struct i40e_tx_entry *txep;
>>>>>>> + void **rxep;
>>>>>>> + struct rte_mbuf *m;
>>>>>>> + int i, n;
>>>>>>> + int nb_rearm = 0;
>>>>>>> +
>>>>>>> + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
>>>>>>> + txq->nb_tx_free > txq->tx_free_thresh)
>>>>>>> + return 0;
>>>>>>> +
>>>>>>> + /* check DD bits on threshold descriptor */
>>>>>>> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
>>>>>>> +
>> rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
>>>>>>> +
>>>>>> rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
>>>>>>> + return 0;
>>>>>>> +
>>>>>>> + n = txq->tx_rs_thresh;
>>>>>>> +
>>>>>>> + /* first buffer to free from S/W ring is at index
>>>>>>> + * tx_next_dd - (tx_rs_thresh-1)
>>>>>>> + */
>>>>>>> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
>>>>>>> + rxep = rxq_rearm_data->rx_sw_ring;
>>>>>>> + rxep += *rxq_rearm_data->rearm_start;
>>>>>>> +
>>>>>>> + if (txq->offloads &
>> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
>>>>>>> + /* directly put mbufs from Tx to Rx */
>>>>>>> + for (i = 0; i < n; i++, rxep++, txep++)
>>>>>>> + *rxep = txep[0].mbuf;
>>>>>>> + } else {
>>>>>>> + for (i = 0; i < n; i++, rxep++) {
>>>>>>> + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
>>>>
>>>> One thing I forgot to ask:
>>>> What would happen if this mbuf belongs to different mempool (not one
>>>> that we specify at rx_queue_setup())?
>>>> Do we need to check it here?
>>>> Or would it be upper layer constraint?
>>>> Or...?
>>>>
>>>
>>> First, 'different mempool' is valid for no FAST_FREE path in tx_free_buffers.
>>>
>>> If buffers belong to different mempool, we can have an example here:
>>> Buffer 1 from mempool 1, its recycle path is:
>>> ----------------------------------------------------------------------
>>> ------------------- 1. queue_setup: rearm from mempool 1 into Rx
>>> sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
>>> rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
>>> into Rx sw-ring:
>>> or
>>> tx_free_buffers: free into mempool 1 (no fast_free path)
>>> ----------------------------------------------------------------------
>>> -------------------
>>>
>>> Buffer 2 from mempool 2, its recycle path is:
>>> ----------------------------------------------------------------------
>>> ------------------- 1. queue_setup: rearm from mempool 2 into Rx
>>> sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
>>> rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
>>> into Rx sw-ring
>>> or
>>> tx_free_buffers: free into mempool 2 (no fast_free_path)
>>> ----------------------------------------------------------------------
>>> -------------------
>>>
>>> Thus, buffers from Tx different mempools are the same for Rx. The
>>> difference point is that they will be freed into different mempool if the
>> thread uses generic free buffers.
>>> I think this cannot affect direct-rearm mode, and we do not need to check
>> this.
>>
>> I understand that it should work even with multiple mempools.
>> What I am trying to say - user may not want to use mbufs from particular
>> mempool for RX (while it is still ok to use it for TX).
>> Let say user can have a separate mempool with small data-buffers (less then
>> normal MTU) to send some 'special' paclets, or even use this memppol with
>> small buffers for zero-copy updating of packet L2/L3 headers, etc.
>> Or it could be some 'special' user provided mempool.
>> That's why I wonder should we allow only mbufs from mempool that is
>> assigned to that RX queue.
>
> Sorry for my misleading. If I understand correctly this time, you means a special
> mempool. Maybe its buffer size is very small and this Tx buffer is generated from control plane.
>
> However, if we recycle this Tx buffer into Rx buffer ring, there maybe some error due to its
> size is so small.
>
> Thus we can only allow general buffers which is valid for Rx buffer ring. Furthermore, this should be
> user's responsibility to ensure the Tx recycling buffers should be valid. If we check this in the data plane,
> it will cost a lot of CPU cycles. At last, what we can do is to add constraint in the notes to remind users.
As I thought: in theory we can add 'struct rte_mempool *mp'
into rte_eth_rxq_rearm_data.
And then:
if (mbuf->pool == rxq_rearm_data->mp)
/* put mbuf into rearm buffer */
else
/* free mbuf */
For the 'proper' config (when txq contains mbufs from expected mempool)
the overhead will be minimal.
In other case it might be higher, but still would work and no need for
extra limitations.
>>
>>>
>>>>>>> + if (m != NULL) {
>>>>>>> + *rxep = m;
>>>>>>> + nb_rearm++;
>>>>>>> + }
>>>>>>> + }
>>>>>>> + n = nb_rearm;
>>>>>>> + }
>>>>>>> +
>>>>>>> + /* update counters for Tx */
>>>>>>> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq-
>>> tx_rs_thresh);
>>>>>>> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq-
>>> tx_rs_thresh);
>>>>>>> + if (txq->tx_next_dd >= txq->nb_tx_desc)
>>>>>>> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
>>>>>>> +
>>>>>>> + return n;
>>>>>>> +}
>>>>>>> +
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
` (3 preceding siblings ...)
2023-01-31 6:13 ` 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side Feifei Wang
@ 2023-03-22 12:56 ` Morten Brørup
2023-03-22 13:41 ` Honnappa Nagarahalli
4 siblings, 1 reply; 145+ messages in thread
From: Morten Brørup @ 2023-03-22 12:56 UTC (permalink / raw)
To: Feifei Wang
Cc: dev, konstantin.v.ananyev, nd, konstantin.ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang, honnappa.nagarahalli
> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> Sent: Wednesday, 4 January 2023 08.31
>
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
>
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into its software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
I am starting to wonder if we have been adding unnecessary feature creep in order to make this feature too generic.
Could you please describe some of the most important high-volume use cases from real life? It would help setting the scope correctly.
>
> However, this solution poses several constraints:
>
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
>
> If the NIC has 2 ports (there are several references), then we will have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.
>
> 2)The other factor that needs to be considered is 'run-to-completion' vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at this
> point.
>
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number.
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
>
> Current status in this patch:
> 1)Two APIs are added for users to enable direct-rearm mode:
> In control plane, users can call 'rte_eth_rx_queue_rearm_data_get'
> to get Rx sw_ring pointer and its rxq_info.
> (This avoid Tx load Rx data directly);
>
> In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx
> buffers and free Tx buffers at the same time. Specifically, in this
> API, there are two separated API for Rx and Tx.
> For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx freed
> buffers.
> For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based
> on the rearm buffers.
> Thus, this can separate Rx and Tx operation, and user can even re-arm
> RX queue not from the same driver's TX queue, but from different
> sources too.
> -----------------------------------------------------------------------
> control plane:
> rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data);
> data plane:
> loop {
> rte_eth_dev_direct_rearm(*rxq_rearm_data){
>
> rte_eth_tx_fill_sw_ring{
> for (i = 0; i <= 32; i++) {
> sw_ring.mbuf[i] = tx.mbuf[i];
> }
> }
>
> rte_eth_rx_flush_descriptor{
> for (i = 0; i <= 32; i++) {
> flush descs[i];
> }
> }
> }
> rte_eth_rx_burst;
> rte_eth_tx_burst;
> }
> -----------------------------------------------------------------------
> 2)The i40e driver is changed to do the direct re-arm of the receive
> side.
> 3)The ixgbe driver is changed to do the direct re-arm of the receive
> side.
>
> Testing status:
> (1) dpdk l3fwd test with multiple drivers:
> port 0: 82599 NIC port 1: XL710 NIC
> -------------------------------------------------------------
> Without fast free With fast free
> Thunderx2: +9.44% +7.14%
> -------------------------------------------------------------
>
> (2) dpdk l3fwd test with same driver:
> port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> *Direct rearm with exposing rx_sw_ring:
> Without fast free With fast free
> Ampere altra: +14.98% +15.77%
> n1sdp: +6.47% +0.52%
> -------------------------------------------------------------
>
> (3) VPP test with same driver:
> port 0 && 1: XL710 NIC
> -------------------------------------------------------------
> *Direct rearm with exposing rx_sw_ring:
> Ampere altra: +4.59%
> n1sdp: +5.4%
> -------------------------------------------------------------
>
> Reference:
> [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> adapters/100gb-nic-ocp/n1100g
>
> V2:
> 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
> 2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
> 3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
> 4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
>
> V3:
> 1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
> 2. Delete L3fwd change for direct rearm (Jerin)
> 3. enable direct rearm in ixgbe driver in Arm
>
> Feifei Wang (3):
> ethdev: enable direct rearm with separate API
> net/i40e: enable direct rearm with separate API
> net/ixgbe: enable direct rearm with separate API
>
> drivers/net/i40e/i40e_ethdev.c | 1 +
> drivers/net/i40e/i40e_ethdev.h | 2 +
> drivers/net/i40e/i40e_rxtx.c | 19 +++
> drivers/net/i40e/i40e_rxtx.h | 4 +
> drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++
> drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++
> drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
> drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
> drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++
> drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
> drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++
> drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++
> lib/ethdev/ethdev_driver.h | 10 ++
> lib/ethdev/ethdev_private.c | 2 +
> lib/ethdev/rte_ethdev.c | 52 +++++++
> lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++
> lib/ethdev/rte_ethdev_core.h | 11 ++
> lib/ethdev/version.map | 6 +
> 18 files changed, 504 insertions(+)
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-03-22 12:56 ` Morten Brørup
@ 2023-03-22 13:41 ` Honnappa Nagarahalli
2023-03-22 14:04 ` Morten Brørup
0 siblings, 1 reply; 145+ messages in thread
From: Honnappa Nagarahalli @ 2023-03-22 13:41 UTC (permalink / raw)
To: Morten Brørup, Feifei Wang
Cc: dev, konstantin.v.ananyev, nd, konstantin.ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang, nd
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Wednesday, March 22, 2023 7:57 AM
> To: Feifei Wang <Feifei.Wang2@arm.com>
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> konstantin.ananyev@huawei.com; Yuying Zhang <Yuying.Zhang@intel.com>;
> Beilei Xing <beilei.xing@intel.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side
>
> > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > Sent: Wednesday, 4 January 2023 08.31
> >
> > Currently, the transmit side frees the buffers into the lcore cache
> > and the receive side allocates buffers from the lcore cache. The
> > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > stores to lcore cache. The receive side allocates 32 buffers and
> > stores them in the receive side software ring, resulting in 32*8=256B
> > of stores and 256B of load from the lcore cache.
> >
> > This patch proposes a mechanism to avoid freeing to/allocating from
> > the lcore cache. i.e. the receive side will free the buffers from
> > transmit side directly into its software ring. This will avoid the
> > 256B of loads and stores introduced by the lcore cache. It also frees
> > up the cache lines used by the lcore cache.
>
> I am starting to wonder if we have been adding unnecessary feature creep in
> order to make this feature too generic.
Can you please elaborate on the feature creep you are thinking? The features have been the same since the first implementation, but it is made more generic.
>
> Could you please describe some of the most important high-volume use cases
> from real life? It would help setting the scope correctly.
The use cases have been discussed several times already.
>
> >
> > However, this solution poses several constraints:
> >
> > 1)The receive queue needs to know which transmit queue it should take
> > the buffers from. The application logic decides which transmit port to
> > use to send out the packets. In many use cases the NIC might have a
> > single port ([1], [2], [3]), in which case a given transmit queue is
> > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > is easy to configure.
> >
> > If the NIC has 2 ports (there are several references), then we will
> > have
> > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > However, if this is generalized to 'N' ports, the configuration can be
> > long. More over the PMD would have to scan a list of transmit queues
> > to pull the buffers from.
> >
> > 2)The other factor that needs to be considered is 'run-to-completion'
> > vs 'pipeline' models. In the run-to-completion model, the receive side
> > and the transmit side are running on the same lcore serially. In the
> > pipeline model. The receive side and transmit side might be running on
> > different lcores in parallel. This requires locking. This is not
> > supported at this point.
> >
> > 3)Tx and Rx buffers must be from the same mempool. And we also must
> > ensure Tx buffer free number is equal to Rx buffer free number.
> > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> > is due to tx_next_dd is a variable to compute tx sw-ring free location.
> > Its value will be one more round than the position where next time
> > free starts.
> >
> > Current status in this patch:
> > 1)Two APIs are added for users to enable direct-rearm mode:
> > In control plane, users can call 'rte_eth_rx_queue_rearm_data_get'
> > to get Rx sw_ring pointer and its rxq_info.
> > (This avoid Tx load Rx data directly);
> >
> > In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx
> > buffers and free Tx buffers at the same time. Specifically, in this
> > API, there are two separated API for Rx and Tx.
> > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx
> > freed buffers.
> > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based
> > on the rearm buffers.
> > Thus, this can separate Rx and Tx operation, and user can even re-arm
> > RX queue not from the same driver's TX queue, but from different
> > sources too.
> > -----------------------------------------------------------------------
> > control plane:
> > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data);
> > data plane:
> > loop {
> > rte_eth_dev_direct_rearm(*rxq_rearm_data){
> >
> > rte_eth_tx_fill_sw_ring{
> > for (i = 0; i <= 32; i++) {
> > sw_ring.mbuf[i] = tx.mbuf[i];
> > }
> > }
> >
> > rte_eth_rx_flush_descriptor{
> > for (i = 0; i <= 32; i++) {
> > flush descs[i];
> > }
> > }
> > }
> > rte_eth_rx_burst;
> > rte_eth_tx_burst;
> > }
> > ----------------------------------------------------------------------
> > - 2)The i40e driver is changed to do the direct re-arm of the receive
> > side.
> > 3)The ixgbe driver is changed to do the direct re-arm of the receive
> > side.
> >
> > Testing status:
> > (1) dpdk l3fwd test with multiple drivers:
> > port 0: 82599 NIC port 1: XL710 NIC
> > -------------------------------------------------------------
> > Without fast free With fast free
> > Thunderx2: +9.44% +7.14%
> > -------------------------------------------------------------
> >
> > (2) dpdk l3fwd test with same driver:
> > port 0 && 1: XL710 NIC
> > -------------------------------------------------------------
> > *Direct rearm with exposing rx_sw_ring:
> > Without fast free With fast free
> > Ampere altra: +14.98% +15.77%
> > n1sdp: +6.47% +0.52%
> > -------------------------------------------------------------
> >
> > (3) VPP test with same driver:
> > port 0 && 1: XL710 NIC
> > -------------------------------------------------------------
> > *Direct rearm with exposing rx_sw_ring:
> > Ampere altra: +4.59%
> > n1sdp: +5.4%
> > -------------------------------------------------------------
> >
> > Reference:
> > [1]
> > https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-
> >
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypto
> Disabled
> > / [2]
> > https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> > ethernet-network-adapter-e810cqda1/specifications.html
> > [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> > adapters/100gb-nic-ocp/n1100g
> >
> > V2:
> > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2.
> > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use
> > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add
> > condition detection for direct rearm API (Morten, Andrew Rybchenko)
> >
> > V3:
> > 1. Seperate Rx and Tx operation with two APIs in direct-rearm
> > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable
> > direct rearm in ixgbe driver in Arm
> >
> > Feifei Wang (3):
> > ethdev: enable direct rearm with separate API
> > net/i40e: enable direct rearm with separate API
> > net/ixgbe: enable direct rearm with separate API
> >
> > drivers/net/i40e/i40e_ethdev.c | 1 +
> > drivers/net/i40e/i40e_ethdev.h | 2 +
> > drivers/net/i40e/i40e_rxtx.c | 19 +++
> > drivers/net/i40e/i40e_rxtx.h | 4 +
> > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++
> > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++
> > drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
> > drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
> > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++
> > drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
> > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++
> > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++
> > lib/ethdev/ethdev_driver.h | 10 ++
> > lib/ethdev/ethdev_private.c | 2 +
> > lib/ethdev/rte_ethdev.c | 52 +++++++
> > lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++
> > lib/ethdev/rte_ethdev_core.h | 11 ++
> > lib/ethdev/version.map | 6 +
> > 18 files changed, 504 insertions(+)
> >
> > --
> > 2.25.1
> >
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 0/3] Direct re-arming of buffers on receive side
2023-03-22 13:41 ` Honnappa Nagarahalli
@ 2023-03-22 14:04 ` Morten Brørup
0 siblings, 0 replies; 145+ messages in thread
From: Morten Brørup @ 2023-03-22 14:04 UTC (permalink / raw)
To: Honnappa Nagarahalli, Feifei Wang
Cc: dev, konstantin.v.ananyev, nd, konstantin.ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang, nd
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 22 March 2023 14.42
>
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Wednesday, March 22, 2023 7:57 AM
> >
> > > From: Feifei Wang [mailto:feifei.wang2@arm.com]
> > > Sent: Wednesday, 4 January 2023 08.31
> > >
> > > Currently, the transmit side frees the buffers into the lcore cache
> > > and the receive side allocates buffers from the lcore cache. The
> > > transmit side typically frees 32 buffers resulting in 32*8=256B of
> > > stores to lcore cache. The receive side allocates 32 buffers and
> > > stores them in the receive side software ring, resulting in 32*8=256B
> > > of stores and 256B of load from the lcore cache.
> > >
> > > This patch proposes a mechanism to avoid freeing to/allocating from
> > > the lcore cache. i.e. the receive side will free the buffers from
> > > transmit side directly into its software ring. This will avoid the
> > > 256B of loads and stores introduced by the lcore cache. It also frees
> > > up the cache lines used by the lcore cache.
> >
> > I am starting to wonder if we have been adding unnecessary feature creep in
> > order to make this feature too generic.
> Can you please elaborate on the feature creep you are thinking? The features
> have been the same since the first implementation, but it is made more
> generic.
Maybe not "features" as such; but the API has evolved, and perhaps we could simplify both the API and the implementation if we narrowed the scope.
I'm not saying that what we have is bad or too complex; I'm only asking to consider if there are opportunities to take a step back and simplify some things.
>
> >
> > Could you please describe some of the most important high-volume use cases
> > from real life? It would help setting the scope correctly.
> The use cases have been discussed several times already.
Yes, but they should be mentioned in the patch cover letter - and later on in the documentation.
It will help limiting the scope while developing this feature. And it will make it easier for application developers to relate to the feature and determine if it is relevant for their application.
>
> >
> > >
> > > However, this solution poses several constraints:
> > >
> > > 1)The receive queue needs to know which transmit queue it should take
> > > the buffers from. The application logic decides which transmit port to
> > > use to send out the packets. In many use cases the NIC might have a
> > > single port ([1], [2], [3]), in which case a given transmit queue is
> > > always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> > > is easy to configure.
> > >
> > > If the NIC has 2 ports (there are several references), then we will
> > > have
> > > 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> > > However, if this is generalized to 'N' ports, the configuration can be
> > > long. More over the PMD would have to scan a list of transmit queues
> > > to pull the buffers from.
> > >
> > > 2)The other factor that needs to be considered is 'run-to-completion'
> > > vs 'pipeline' models. In the run-to-completion model, the receive side
> > > and the transmit side are running on the same lcore serially. In the
> > > pipeline model. The receive side and transmit side might be running on
> > > different lcores in parallel. This requires locking. This is not
> > > supported at this point.
> > >
> > > 3)Tx and Rx buffers must be from the same mempool. And we also must
> > > ensure Tx buffer free number is equal to Rx buffer free number.
> > > Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> > > is due to tx_next_dd is a variable to compute tx sw-ring free location.
> > > Its value will be one more round than the position where next time
> > > free starts.
> > >
> > > Current status in this patch:
> > > 1)Two APIs are added for users to enable direct-rearm mode:
> > > In control plane, users can call 'rte_eth_rx_queue_rearm_data_get'
> > > to get Rx sw_ring pointer and its rxq_info.
> > > (This avoid Tx load Rx data directly);
> > >
> > > In data plane, users can call 'rte_eth_dev_direct_rearm' to rearm Rx
> > > buffers and free Tx buffers at the same time. Specifically, in this
> > > API, there are two separated API for Rx and Tx.
> > > For Tx, 'rte_eth_tx_fill_sw_ring' can fill a given sw_ring by Tx
> > > freed buffers.
> > > For Rx, 'rte_eth_rx_flush_descriptor' can flush its descriptors based
> > > on the rearm buffers.
> > > Thus, this can separate Rx and Tx operation, and user can even re-arm
> > > RX queue not from the same driver's TX queue, but from different
> > > sources too.
> > > -----------------------------------------------------------------------
> > > control plane:
> > > rte_eth_rx_queue_rearm_data_get(*rxq_rearm_data);
> > > data plane:
> > > loop {
> > > rte_eth_dev_direct_rearm(*rxq_rearm_data){
> > >
> > > rte_eth_tx_fill_sw_ring{
> > > for (i = 0; i <= 32; i++) {
> > > sw_ring.mbuf[i] = tx.mbuf[i];
> > > }
> > > }
> > >
> > > rte_eth_rx_flush_descriptor{
> > > for (i = 0; i <= 32; i++) {
> > > flush descs[i];
> > > }
> > > }
> > > }
> > > rte_eth_rx_burst;
> > > rte_eth_tx_burst;
> > > }
> > > ----------------------------------------------------------------------
> > > - 2)The i40e driver is changed to do the direct re-arm of the receive
> > > side.
> > > 3)The ixgbe driver is changed to do the direct re-arm of the receive
> > > side.
> > >
> > > Testing status:
> > > (1) dpdk l3fwd test with multiple drivers:
> > > port 0: 82599 NIC port 1: XL710 NIC
> > > -------------------------------------------------------------
> > > Without fast free With fast free
> > > Thunderx2: +9.44% +7.14%
> > > -------------------------------------------------------------
> > >
> > > (2) dpdk l3fwd test with same driver:
> > > port 0 && 1: XL710 NIC
> > > -------------------------------------------------------------
> > > *Direct rearm with exposing rx_sw_ring:
> > > Without fast free With fast free
> > > Ampere altra: +14.98% +15.77%
> > > n1sdp: +6.47% +0.52%
> > > -------------------------------------------------------------
> > >
> > > (3) VPP test with same driver:
> > > port 0 && 1: XL710 NIC
> > > -------------------------------------------------------------
> > > *Direct rearm with exposing rx_sw_ring:
> > > Ampere altra: +4.59%
> > > n1sdp: +5.4%
> > > -------------------------------------------------------------
> > >
> > > Reference:
> > > [1]
> > > https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-
> > >
> > CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECrypto
> > Disabled
> > > / [2]
> > > https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> > > ethernet-network-adapter-e810cqda1/specifications.html
> > > [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> > > adapters/100gb-nic-ocp/n1100g
> > >
> > > V2:
> > > 1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa) 2.
> > > Add 'txq_data_get' API to get txq info for Rx (Konstantin) 3. Use
> > > input parameter to enable direct rearm in l3fwd (Konstantin) 4. Add
> > > condition detection for direct rearm API (Morten, Andrew Rybchenko)
> > >
> > > V3:
> > > 1. Seperate Rx and Tx operation with two APIs in direct-rearm
> > > (Konstantin) 2. Delete L3fwd change for direct rearm (Jerin) 3. enable
> > > direct rearm in ixgbe driver in Arm
> > >
> > > Feifei Wang (3):
> > > ethdev: enable direct rearm with separate API
> > > net/i40e: enable direct rearm with separate API
> > > net/ixgbe: enable direct rearm with separate API
> > >
> > > drivers/net/i40e/i40e_ethdev.c | 1 +
> > > drivers/net/i40e/i40e_ethdev.h | 2 +
> > > drivers/net/i40e/i40e_rxtx.c | 19 +++
> > > drivers/net/i40e/i40e_rxtx.h | 4 +
> > > drivers/net/i40e/i40e_rxtx_vec_common.h | 54 +++++++
> > > drivers/net/i40e/i40e_rxtx_vec_neon.c | 42 ++++++
> > > drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
> > > drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
> > > drivers/net/ixgbe/ixgbe_rxtx.c | 19 +++
> > > drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
> > > drivers/net/ixgbe/ixgbe_rxtx_vec_common.h | 48 ++++++
> > > drivers/net/ixgbe/ixgbe_rxtx_vec_neon.c | 52 +++++++
> > > lib/ethdev/ethdev_driver.h | 10 ++
> > > lib/ethdev/ethdev_private.c | 2 +
> > > lib/ethdev/rte_ethdev.c | 52 +++++++
> > > lib/ethdev/rte_ethdev.h | 174 ++++++++++++++++++++++
> > > lib/ethdev/rte_ethdev_core.h | 11 ++
> > > lib/ethdev/version.map | 6 +
> > > 18 files changed, 504 insertions(+)
> > >
> > > --
> > > 2.25.1
> > >
>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
2023-03-07 20:41 ` Ferruh Yigit
@ 2023-03-22 14:43 ` Honnappa Nagarahalli
0 siblings, 0 replies; 145+ messages in thread
From: Honnappa Nagarahalli @ 2023-03-22 14:43 UTC (permalink / raw)
To: Ferruh Yigit, Morten Brørup, Feifei Wang, thomas,
Andrew Rybchenko, techboard
Cc: dev, konstantin.v.ananyev, nd, Ruifeng Wang, nd
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@amd.com>
> Sent: Tuesday, March 7, 2023 2:41 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Morten Brørup
> <mb@smartsharesystems.com>; Feifei Wang <Feifei.Wang2@arm.com>;
> thomas@monjalon.net; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; techboard@dpdk.org
> Cc: dev@dpdk.org; konstantin.v.ananyev@yandex.ru; nd <nd@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> Subject: Re: [PATCH v3 1/3] ethdev: enable direct rearm with separate API
>
> On 3/7/2023 6:12 AM, Honnappa Nagarahalli wrote:
> > <snip>
> >
> >>
> >> On 3/6/2023 1:26 PM, Morten Brørup wrote:
> >>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >>>> Sent: Monday, 6 March 2023 13.49
> >>>>
> >>>> On 1/4/2023 8:21 AM, Morten Brørup wrote:
> >>>>>> From: Feifei Wang [mailto:feifei.wang2@arm.com]
> >>>>>> Sent: Wednesday, 4 January 2023 08.31
> >>>>>>
> >>>>>> Add 'tx_fill_sw_ring' and 'rx_flush_descriptor' API into direct
> >>>>>> rearm mode for separate Rx and Tx Operation. And this can support
> >>>>>> different multiple sources in direct rearm mode. For examples, Rx
> >>>>>> driver is ixgbe, and Tx driver is i40e.
> >>>>>>
> >>>>>> Suggested-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>>> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>>> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> >>>>>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>>>>> Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> >>>>>> ---
> >>>>>
> >>>>> This feature looks very promising for performance. I am pleased to
> >>>>> see
> >>>> progress on it.
> >>>>>
> >>>>
> >>>> Hi Morten,
> >>>>
> >>>> Yes it brings some performance, but not to generic use case, only
> >>>> to specific and constraint use case.
> >>>
> >>> I got the impression that the supported use case is a prominent and
> >>> important
> >> use case.
> >>>
> >>
> >> Can you please give real life samples for this use case, other than
> >> just showing better performance number in the test bench? This helps
> >> to understand the reasoning better.
> > The very first patch started off with a constrained but prominent use case.
> Though, DPU based PCIe cards running DPDK applications with 1 or max 2 ports
> being used in tons of data centers is not a secret anymore and not a small use
> case that can be ignored.
> > However, the design of the patch has changed significantly from then. Now
> the solution can be applied to any generic use case that uses run-to-completion
> model of DPDK. i.e. the mapping of the RX and TX ports can be done
> dynamically in the data plane threads. There is no need of static configuration
> from control plane.
> >
> > On the test bench, we need to make up our mind. When we see
> improvements, we say it is just a test bench. On other occasions when the test
> bench does not show any improvements (but improvements are shown by
> other metrics), we say the test bench does not show any improvements.
> >
> >>
> >>> This is the primary argument for considering such a complex
> >>> non-generic
> >> feature.
> > I am not sure what is the complexity here, can you please elaborate?
>
> I am considering from user perspective.
Thanks for clarifying Ferruh.
>
> OK, DPDK is already low level, but ethdev has only a handful of datapath APIs (6
> of them), and main ones are easy to comprehend:
> rte_eth_rx_burst(port_id, queue_id, rx_pkts, nb_pkts);
> rte_eth_tx_burst(port_id, queue_id, tx_pkts, nb_pkts);
>
> They (magically) Rx/Tx buffers, easy to grasp.
I think the pktmbuf pool part is missed here. The user needs to create a pktmbuf pool by calling rte_pktmbuf_pool_create and has to pass the cache_size parameter.
This requires the user to understand what is a cache, why it is required and how it affects the performance.
There are further complexities associated with pktmbuf pool - creating a pool with external pinned memory, creating a pool with ops name etc.
So, practically, the user needs to be aware of more details than just the RX and TX functions.
>
> Maybe rte_eth_tx_prepare() is a little less obvious (why/when to use it), but still
> I believe simple.
>
> Whoever looks to these APIs can figure out how to use in the application.
>
> The other three is related to the descriptors and I am not sure about their use-
> case, I assume they are mostly good for debugging.
>
>
> But now we are adding new datapath APIs:
> rte_eth_tx_fill_sw_ring(port_id, queue_id, rxq_rearm_data);
> rte_eth_rx_flush_descriptor(port_id, queue_id, nb_rearm);
>
> When you talk about SW ring and re-arming descriptors I believe you will loose
> most of the users already, driver developers will know what it is, you will know
> what that is, but people who are not close to the Ethernet HW won't.
Agree, the names could be better. I personally do not want to separate out these two APIs as I do not think a use case (receiving and transmitting pkts across NICs of different types) exists to keep them separate. But, we did this based on feedback and to maintain a cleaner separation between RX and TX path.
We will try to propose new names for these.
>
> And these APIs will be very visible, not like one of many control plane dev_ops.
> So this can confuse users who are not familiar with details.
>
> Usage of these APIs comes with restrictions, it is possible that at some
> percentage of users will miss these restrictions or miss-understand them and will
> have issues.
Agreed, there are several features already with restrictions.
>
> Or many may be intimidated by them and stay away from using these APIs,
> leaving them as a burden to maintain, to test, to fix. That is why I think a real life
> usecase is needed, in that case at least we will know some consumers will fix or
> let us know when they get broken.
>
> It may be possible to hide details under driver and user only set an offload flag,
> similar to FAST_FREE, but in that case feature will loose flexibility and it will be
> even more specific, perhaps making it less useful.
Agree.
>
>
> > I see other patches/designs (ex: proactive error recovery) which are way more
> complex to understand and comprehend.
> >
> >>>
> >>>>
> >>>> And changes are relatively invasive comparing the usecase it
> >>>> supports, like it adds new two inline datapath functions and a new
> dev_ops.
> >>>>
> >>>> I am worried the unnecessary complexity and possible regressions in
> >>>> the fundamental and simple parts of the project, with a good
> >>>> intention to gain a few percentage performance in a specific
> >>>> usecase, can hurt the project.
> > I agree that we are touching some fundamental parts of the project. But, we
> also need to realize that those fundamental parts were not developed on
> architectures that have joined the project way later. Similarly, the use cases
> have evolved significantly from the original intended use cases. We cannot hold
> on to those fundamental designs if they affect the performance on other
> architectures while addressing prominent new use cases.
> > Please note that this patch does not break any existing features or affect their
> performance in any negative way. The generic and originally intended use cases
> can benefit from this feature.
> >
> >>>>
> >>>>
> >>>> I can see this is compared to MBUF_FAST_FREE feature, but
> >>>> MBUF_FAST_FREE is just an offload benefiting from existing offload
> >>>> infrastructure, which requires very small update and logically
> >>>> change in application and simple to implement in the drivers. So,
> >>>> they are not same from complexity perspective.
> >>>>
> >>>> Briefly, I am not comfortable with this change, I would like to see
> >>>> an explicit approval and code review from techboard to proceed.
> >>>
> >>> I agree that the complexity is very high, and thus requires extra
> consideration.
> >> Your suggested techboard review and approval process seems like a
> >> good solution.
> > We can add to the agenda for the next Techboard meeting.
> >
> >>>
> >>> And the performance benefit of direct rearm should be compared to
> >>> the
> >> performance using the new zero-copy mempool API.
> >>>
> >>> -Morten
> >>>
^ permalink raw reply [flat|nested] 145+ messages in thread
* RE: 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with separate API
2023-03-19 16:11 ` Konstantin Ananyev
@ 2023-03-23 10:49 ` Feifei Wang
0 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-03-23 10:49 UTC (permalink / raw)
To: Konstantin Ananyev, Konstantin Ananyev, Yuying Zhang,
Beilei Xing, Ruifeng Wang
Cc: dev, nd, Honnappa Nagarahalli, nd
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Sent: Monday, March 20, 2023 12:11 AM
> To: Feifei Wang <Feifei.Wang2@arm.com>; Konstantin Ananyev
> <konstantin.ananyev@huawei.com>; Yuying Zhang
> <Yuying.Zhang@intel.com>; Beilei Xing <beilei.xing@intel.com>; Ruifeng
> Wang <Ruifeng.Wang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: Re: 回复: [PATCH v3 2/3] net/i40e: enable direct rearm with
> separate API
>
>
> >>>>>>> +int
> >>>>>>> +i40e_tx_fill_sw_ring(void *tx_queue,
> >>>>>>> + struct rte_eth_rxq_rearm_data *rxq_rearm_data) {
> >>>>>>> + struct i40e_tx_queue *txq = tx_queue;
> >>>>>>> + struct i40e_tx_entry *txep;
> >>>>>>> + void **rxep;
> >>>>>>> + struct rte_mbuf *m;
> >>>>>>> + int i, n;
> >>>>>>> + int nb_rearm = 0;
> >>>>>>> +
> >>>>>>> + if (*rxq_rearm_data->rearm_nb < txq->tx_rs_thresh ||
> >>>>>>> + txq->nb_tx_free > txq->tx_free_thresh)
> >>>>>>> + return 0;
> >>>>>>> +
> >>>>>>> + /* check DD bits on threshold descriptor */
> >>>>>>> + if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> >>>>>>> +
> >> rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
> >>>>>>> +
> >>>>>> rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
> >>>>>>> + return 0;
> >>>>>>> +
> >>>>>>> + n = txq->tx_rs_thresh;
> >>>>>>> +
> >>>>>>> + /* first buffer to free from S/W ring is at index
> >>>>>>> + * tx_next_dd - (tx_rs_thresh-1)
> >>>>>>> + */
> >>>>>>> + txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
> >>>>>>> + rxep = rxq_rearm_data->rx_sw_ring;
> >>>>>>> + rxep += *rxq_rearm_data->rearm_start;
> >>>>>>> +
> >>>>>>> + if (txq->offloads &
> >> RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> >>>>>>> + /* directly put mbufs from Tx to Rx */
> >>>>>>> + for (i = 0; i < n; i++, rxep++, txep++)
> >>>>>>> + *rxep = txep[0].mbuf;
> >>>>>>> + } else {
> >>>>>>> + for (i = 0; i < n; i++, rxep++) {
> >>>>>>> + m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> >>>>
> >>>> One thing I forgot to ask:
> >>>> What would happen if this mbuf belongs to different mempool (not
> >>>> one that we specify at rx_queue_setup())?
> >>>> Do we need to check it here?
> >>>> Or would it be upper layer constraint?
> >>>> Or...?
> >>>>
> >>>
> >>> First, 'different mempool' is valid for no FAST_FREE path in
> tx_free_buffers.
> >>>
> >>> If buffers belong to different mempool, we can have an example here:
> >>> Buffer 1 from mempool 1, its recycle path is:
> >>> --------------------------------------------------------------------
> >>> --
> >>> ------------------- 1. queue_setup: rearm from mempool 1 into Rx
> >>> sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
> >>> rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
> >>> into Rx sw-ring:
> >>> or
> >>> tx_free_buffers: free into mempool 1 (no fast_free path)
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------------------
> >>>
> >>> Buffer 2 from mempool 2, its recycle path is:
> >>> --------------------------------------------------------------------
> >>> --
> >>> ------------------- 1. queue_setup: rearm from mempool 2 into Rx
> >>> sw-ring 2. rte_eth_Rx_burst: used by user app (Rx) 3.
> >>> rte_eth_Tx_burst: mount on Tx sw-ring 4. rte_eth_direct_rearm: free
> >>> into Rx sw-ring
> >>> or
> >>> tx_free_buffers: free into mempool 2 (no fast_free_path)
> >>> --------------------------------------------------------------------
> >>> --
> >>> -------------------
> >>>
> >>> Thus, buffers from Tx different mempools are the same for Rx. The
> >>> difference point is that they will be freed into different mempool
> >>> if the
> >> thread uses generic free buffers.
> >>> I think this cannot affect direct-rearm mode, and we do not need to
> >>> check
> >> this.
> >>
> >> I understand that it should work even with multiple mempools.
> >> What I am trying to say - user may not want to use mbufs from
> >> particular mempool for RX (while it is still ok to use it for TX).
> >> Let say user can have a separate mempool with small data-buffers
> >> (less then normal MTU) to send some 'special' paclets, or even use
> >> this memppol with small buffers for zero-copy updating of packet L2/L3
> headers, etc.
> >> Or it could be some 'special' user provided mempool.
> >> That's why I wonder should we allow only mbufs from mempool that is
> >> assigned to that RX queue.
> >
> > Sorry for my misleading. If I understand correctly this time, you
> > means a special mempool. Maybe its buffer size is very small and this Tx
> buffer is generated from control plane.
> >
> > However, if we recycle this Tx buffer into Rx buffer ring, there maybe
> > some error due to its size is so small.
> >
> > Thus we can only allow general buffers which is valid for Rx buffer
> > ring. Furthermore, this should be user's responsibility to ensure the
> > Tx recycling buffers should be valid. If we check this in the data plane, it will
> cost a lot of CPU cycles. At last, what we can do is to add constraint in the
> notes to remind users.
>
> As I thought: in theory we can add 'struct rte_mempool *mp'
> into rte_eth_rxq_rearm_data.
> And then:
> if (mbuf->pool == rxq_rearm_data->mp)
> /* put mbuf into rearm buffer */
> else
> /* free mbuf */
> For the 'proper' config (when txq contains mbufs from expected mempool)
> the overhead will be minimal.
> In other case it might be higher, but still would work and no need for extra
> limitations.
It's a good idea. And try to test performance with this change, there is currently
no performance degradation. Thus, I add this check in the latest version.
>
>
> >>
> >>>
> >>>>>>> + if (m != NULL) {
> >>>>>>> + *rxep = m;
> >>>>>>> + nb_rearm++;
> >>>>>>> + }
> >>>>>>> + }
> >>>>>>> + n = nb_rearm;
> >>>>>>> + }
> >>>>>>> +
> >>>>>>> + /* update counters for Tx */
> >>>>>>> + txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq-
> >>> tx_rs_thresh);
> >>>>>>> + txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq-
> >>> tx_rs_thresh);
> >>>>>>> + if (txq->tx_next_dd >= txq->nb_tx_desc)
> >>>>>>> + txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
> >>>>>>> +
> >>>>>>> + return n;
> >>>>>>> +}
> >>>>>>> +
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (7 preceding siblings ...)
2023-01-04 7:30 ` [PATCH v3 0/3] " Feifei Wang
@ 2023-08-02 7:38 ` Feifei Wang
2023-08-02 7:38 ` [PATCH v8 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
` (3 more replies)
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (4 subsequent siblings)
13 siblings, 4 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 7:38 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as mbufs
recycle mode.
In the latest version, mbufs recycle mode is packaged as a separate API.
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing at run-time and user does not need to
know the direction of flow in advance. This can effectively expand mbufs recycle mode's
use scenarios.
Furthermore, mbufs recycle mode is no longer limited to the same pmd,
it can support moving mbufs between different vendor pmds, even can put the mbufs
anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by mbufs recycle mode.
Difference between mbuf recycle, ZC API used in mempool and general path
For general path:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
So, mbufs recycle has its own benefits.
Testing status:
(1) dpdk l3fwd test with multiple drivers:
port 0: 82599 NIC port 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Thunderx2: +7.53% +13.54%
-------------------------------------------------------------
(2) dpdk l3fwd test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Ampere altra: +12.61% +11.42%
n1sdp: +8.30% +3.85%
x86-sse: +8.43% +3.72%
-------------------------------------------------------------
(3) Performance comparison with ZC_mempool used
port 0 && 1: XL710 NIC
with fast free
-------------------------------------------------------------
With recycle buffer With zc_mempool
Ampere altra: 11.42% 3.54%
-------------------------------------------------------------
Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
I/O bottleneck in testpmd in ampere altra, we can not see throughput change
compared with I/O fwd engine. However, using record cmd in testpmd:
'$set record-burst-stats on'
we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
indicate mbufs recycle can save CPU cycles.
V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm
v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)
v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path
v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)
v7:
1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
in order to keep them in the same cache line as rx/tx_burst function
pointers (Morten)
2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
support feeding 1 Rx queue from 2 Tx queues in the same thread
(Konstantin)
3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
in testpmd fwd engine (Morten)
v8:
1. add arm/x86 build option to fix ixgbe build issue in ppc
Feifei Wang (4):
ethdev: add API for mbufs recycle mode
net/i40e: implement mbufs recycle mode
net/ixgbe: implement mbufs recycle mode
app/testpmd: add recycle mbufs engine
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 ++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 +
doc/guides/rel_notes/release_23_11.rst | 15 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 +++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 3 +
lib/ethdev/ethdev_driver.h | 10 +
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++
lib/ethdev/rte_ethdev.h | 181 ++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 ++-
lib/ethdev/version.map | 4 +
25 files changed, 704 insertions(+), 9 deletions(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v8 1/4] ethdev: add API for mbufs recycle mode
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-02 7:38 ` Feifei Wang
2023-08-02 7:38 ` [PATCH v8 2/4] net/i40e: implement " Feifei Wang
` (2 subsequent siblings)
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 7:38 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang,
Morten Brørup
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=yes, Size: 15883 bytes --]
Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.
For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/rel_notes/release_23_11.rst | 15 ++
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++++
lib/ethdev/rte_ethdev.h | 181 +++++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 +++-
lib/ethdev/version.map | 4 +
7 files changed, 260 insertions(+), 6 deletions(-)
diff --git a/doc/guides/rel_notes/release_23_11.rst b/doc/guides/rel_notes/release_23_11.rst
index 6b4dd21fd0..fd16d267ae 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -55,6 +55,13 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Add mbufs recycling support. **
+
+ Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+ APIs which allow the user to copy used mbufs from the Tx mbuf ring
+ into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+ device is different from the Tx Ethernet device with respective driver
+ callback functions in ``rte_eth_recycle_mbufs``.
Removed Items
-------------
@@ -100,6 +107,14 @@ ABI Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* ethdev: Added ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields to ``rte_eth_dev`` structure.
+
+* ethdev: Structure ``rte_eth_fp_ops`` was affected to add
+ ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields, to move ``rxq`` and ``txq`` fields, to change the size of
+ ``reserved1`` and ``reserved2`` fields.
+
Known Issues
------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 980f837ab6..b0c55a8523 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Pointer to PMD transmit mbufs reuse function */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ /** Pointer to PMD receive descriptors refill function */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
/**
* Device data that is shared between primary and secondary processes
@@ -507,6 +511,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
@@ -1250,6 +1258,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+ /** Retrieve mbufs recycle Rx queue information */
+ eth_recycle_rxq_info_get_t recycle_rxq_info_get;
eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+ fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
fpo->rxq.data = dev->data->rx_queues;
fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 0840d2b594..ea89a101a1 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5876,6 +5876,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
return 0;
}
+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+ return 0;
+}
+
int
rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04a2564f22..9dc5749d83 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
} __rte_cache_min_aligned;
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+ struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+ struct rte_mempool *mp; /**< mempool of Rx queue. */
+ uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
+ uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
+ uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
+ /**
+ * Requirement on mbuf refilling batch size of Rx mbuf ring.
+ * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+ * should be aligned with mbuf ring size, in order to simplify
+ * ring wrapping around.
+ * Value 0 means that PMD drivers have no requirement for this.
+ */
+ uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
/* Generic Burst mode flag definition, values can be ORed. */
/**
@@ -4853,6 +4877,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet devicefor which information
+ * will be retrieved.
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+ uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
/**
* Retrieve information about the Rx packet burst mode.
*
@@ -6527,6 +6576,138 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ * from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can support to feed 1 Rx queue from
+ * 2 Tx queues in the same thread. Do not pair the Rx queue and Tx queue in different
+ * threads, in order to avoid memory error rewriting.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ * the information of the Rx queue mbuf ring.
+ * @return
+ * The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+ uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (tx_port_id >= RTE_MAX_ETHPORTS ||
+ tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid tx_port_id=%u or tx_queue_id=%u\n",
+ tx_port_id, tx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[tx_port_id];
+ qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ tx_queue_id, tx_port_id);
+ return 0;
+ }
+#endif
+ if (p->recycle_tx_mbufs_reuse == NULL)
+ return 0;
+
+ /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+ * into Rx mbuf ring.
+ */
+ nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+ /* If no recycling mbufs, return 0. */
+ if (nb_mbufs == 0)
+ return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (rx_port_id >= RTE_MAX_ETHPORTS ||
+ rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+ rx_port_id, rx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[rx_port_id];
+ qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ rx_queue_id, rx_port_id);
+ return 0;
+ }
+#endif
+
+ if (p->recycle_rx_descriptors_refill == NULL)
+ return 0;
+
+ /* Replenish the Rx descriptors with the recycling
+ * into Rx mbuf ring.
+ */
+ p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+ return nb_mbufs;
+}
+
/**
* @warning
* @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index 46e9721e07..a24ad7a6b2 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -55,6 +55,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
/** @internal Check the status of a Tx descriptor */
typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
/**
* @internal
* Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -83,15 +90,17 @@ struct rte_eth_fp_ops {
* Rx fast-path functions and related data.
* 64-bit systems: occupies first 64B line
*/
+ /** Rx queues data. */
+ struct rte_ethdev_qdata rxq;
/** PMD receive function. */
eth_rx_burst_t rx_pkt_burst;
/** Get the number of used Rx descriptors. */
eth_rx_queue_count_t rx_queue_count;
/** Check the status of a Rx descriptor. */
eth_rx_descriptor_status_t rx_descriptor_status;
- /** Rx queues data. */
- struct rte_ethdev_qdata rxq;
- uintptr_t reserved1[3];
+ /** Refill Rx descriptors with the recycling mbufs. */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
+ uintptr_t reserved1[2];
/**@}*/
/**@{*/
@@ -99,15 +108,17 @@ struct rte_eth_fp_ops {
* Tx fast-path functions and related data.
* 64-bit systems: occupies second 64B line
*/
+ /** Tx queues data. */
+ struct rte_ethdev_qdata txq;
/** PMD transmit function. */
eth_tx_burst_t tx_pkt_burst;
/** PMD transmit prepare function. */
eth_tx_prep_t tx_pkt_prepare;
/** Check the status of a Tx descriptor. */
eth_tx_descriptor_status_t tx_descriptor_status;
- /** Tx queues data. */
- struct rte_ethdev_qdata txq;
- uintptr_t reserved2[3];
+ /** Copy used mbufs from Tx mbuf ring into Rx. */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ uintptr_t reserved2[2];
/**@}*/
} __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index b965d6aa52..e52c1563b4 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -312,6 +312,10 @@ EXPERIMENTAL {
rte_flow_async_action_list_handle_query_update;
rte_flow_async_actions_update;
rte_flow_restore_info_dynflag;
+
+ # added in 23.11
+ rte_eth_recycle_mbufs;
+ rte_eth_recycle_rx_queue_info_get;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v8 2/4] net/i40e: implement mbufs recycle mode
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-02 7:38 ` [PATCH v8 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-08-02 7:38 ` Feifei Wang
2023-08-02 7:38 ` [PATCH v8 3/4] net/ixgbe: " Feifei Wang
2023-08-02 7:38 ` [PATCH v8 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 7:38 UTC (permalink / raw)
To: Yuying Zhang, Beilei Xing
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
6 files changed, 187 insertions(+)
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 8271bbb394..50ba9aac94 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+ .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
.rx_burst_mode_get = i40e_rx_burst_mode_get,
.tx_burst_mode_get = i40e_tx_burst_mode_get,
.timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 6f65d5e0ac..af758798e1 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_burst_mode *mode);
int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..5663ecccde
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* flush desc with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = 0;
+ rx_id = rxq->nb_rx_desc - 1;
+ }
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle mode can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * mbuf ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with mbuf ring size. In this case, the update of refill head can not
+ * exceed the Rx mbuf ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index b4f65b58fa..a9c9eb331c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3199,6 +3199,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.offloads = txq->offloads;
}
+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_rx_queue *rxq;
+ struct i40e_adapter *ad =
+ I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mp;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (ad->rx_vec_allowed) {
+ recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
#ifdef RTE_ARCH_X86
static inline bool
get_avx_supported(bool request_avx512)
@@ -3293,6 +3317,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_scattered_pkts_vec_avx2 :
i40e_recv_scattered_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
} else {
if (ad->rx_use_avx512) {
@@ -3311,9 +3337,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_pkts_vec_avx2 :
i40e_recv_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
}
#else /* RTE_ARCH_X86 */
+ dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
if (dev->data->scattered_rx) {
PMD_INIT_LOG(DEBUG,
"Using Vector Scattered Rx (port %d).",
@@ -3481,15 +3510,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
dev->tx_pkt_burst = ad->tx_use_avx2 ?
i40e_xmit_pkts_vec_avx2 :
i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
#else /* RTE_ARCH_X86 */
PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
dev->data->port_id);
dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
#endif /* RTE_ARCH_X86 */
} else {
PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
dev->tx_pkt_prepare = i40e_simple_prep_pkts;
} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index a8686224e5..b191f23e1f 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -236,6 +236,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..3b1a233c84 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -34,6 +34,7 @@ sources = files(
'i40e_tm.c',
'i40e_hash.c',
'i40e_vf_representor.c',
+ 'i40e_recycle_mbufs_vec_common.c',
'rte_pmd_i40e.c',
)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v8 3/4] net/ixgbe: implement mbufs recycle mode
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-02 7:38 ` [PATCH v8 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
2023-08-02 7:38 ` [PATCH v8 2/4] net/i40e: implement " Feifei Wang
@ 2023-08-02 7:38 ` Feifei Wang
2023-08-02 7:38 ` [PATCH v8 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 7:38 UTC (permalink / raw)
To: Qiming Yang, Wenjun Wu
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 ++++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 3 +
6 files changed, 189 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 14a7d571e0..ea4c9dd561 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+ .recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
.timesync_enable = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 1291e9099c..22fc3be3d8 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -626,6 +626,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..9a8cc86954
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct ixgbe_rx_queue *rxq = rx_queue;
+ struct ixgbe_rx_entry *rxep;
+ volatile union ixgbe_adv_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* Flush descriptors with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+ rxq->rxrearm_start = 0;
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+ (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+ /* Update the tail pointer on the NIC */
+ IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_tx_queue *txq = tx_queue;
+ struct ixgbe_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint32_t status;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ status = txq->tx_ring[txq->tx_next_dd].wb.status;
+ if (!(status & IXGBE_ADVTXD_STAT_DD))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * buffer ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with buffer ring size. In this case, the update of refill head can not
+ * exceed the Rx buffer ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 954ef241a0..90b0a7004f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2552,6 +2552,9 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
(rte_eal_process_type() != RTE_PROC_PRIMARY ||
ixgbe_txq_vec_setup(txq) == 0)) {
PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_tx_mbufs_reuse = ixgbe_recycle_tx_mbufs_reuse_vec;
+#endif
dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
} else
dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4890,7 +4893,10 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Scattered Rx "
"callback (port=%d).",
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill =
+ ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4919,7 +4925,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
"burst size no less than %d (port=%d).",
RTE_IXGBE_DESCS_PER_LOOP,
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill = ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5691,6 +5699,31 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
}
+void
+ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_rx_queue *rxq;
+ struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mb_pool;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (adapter->rx_vec_allowed) {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ recycle_rxq_info->refill_requirement = RTE_IXGBE_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+#endif
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
/*
* [VF] Initializes Receive Unit.
*/
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..ee89c89929 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
+uint16_t ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/meson.build b/drivers/net/ixgbe/meson.build
index a18908ef7c..52e77190f7 100644
--- a/drivers/net/ixgbe/meson.build
+++ b/drivers/net/ixgbe/meson.build
@@ -17,6 +17,7 @@ sources = files(
'ixgbe_rxtx.c',
'ixgbe_tm.c',
'ixgbe_vf_representor.c',
+ 'ixgbe_recycle_mbufs_vec_common.c',
'rte_pmd_ixgbe.c',
)
@@ -26,11 +27,13 @@ deps += ['hash', 'security']
if arch_subdir == 'x86'
sources += files('ixgbe_rxtx_vec_sse.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
if is_windows and cc.get_id() != 'clang'
cflags += ['-fno-asynchronous-unwind-tables']
endif
elif arch_subdir == 'arm'
sources += files('ixgbe_rxtx_vec_neon.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
endif
includes += include_directories('base')
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v8 4/4] app/testpmd: add recycle mbufs engine
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (2 preceding siblings ...)
2023-08-02 7:38 ` [PATCH v8 3/4] net/ixgbe: " Feifei Wang
@ 2023-08-02 7:38 ` Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 7:38 UTC (permalink / raw)
To: Aman Singh, Yuying Zhang; +Cc: dev, nd, Feifei Wang, Jerin Jacob, Ruifeng Wang
Add recycle mbufs engine for testpmd. This engine forward pkts with
I/O forward mode. But enable mbufs recycle feature to recycle used
txq mbufs for rxq mbuf ring, which can bypass mempool path and save
CPU cycles.
Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 +++++++++++++++++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
6 files changed, 68 insertions(+), 1 deletion(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index d2e3f60892..6e5f067274 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
'macswap.c',
'noisy_vnf.c',
'parameters.c',
+ 'recycle_mbufs.c',
'rxonly.c',
'shared_rxq_fwd.c',
'testpmd.c',
diff --git a/app/test-pmd/recycle_mbufs.c b/app/test-pmd/recycle_mbufs.c
new file mode 100644
index 0000000000..6e9e1c5eb6
--- /dev/null
+++ b/app/test-pmd/recycle_mbufs.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include "testpmd.h"
+
+/*
+ * Forwarding of packets in I/O mode.
+ * Enable mbufs recycle mode to recycle txq used mbufs
+ * for rxq mbuf ring. This can bypass mempool path and
+ * save CPU cycles.
+ */
+static bool
+pkt_burst_recycle_mbufs(struct fwd_stream *fs)
+{
+ struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+ uint16_t nb_rx;
+
+ /* Recycle used mbufs from the txq, and move these mbufs into
+ * the rxq mbuf ring.
+ */
+ rte_eth_recycle_mbufs(fs->rx_port, fs->rx_queue,
+ fs->tx_port, fs->tx_queue, &(fs->recycle_rxq_info));
+
+ /*
+ * Receive a burst of packets and forward them.
+ */
+ nb_rx = common_fwd_stream_receive(fs, pkts_burst, nb_pkt_per_burst);
+ if (unlikely(nb_rx == 0))
+ return false;
+
+ common_fwd_stream_transmit(fs, pkts_burst, nb_rx);
+
+ return true;
+}
+
+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+ int rc;
+
+ /* Retrieve information about given ports's Rx queue
+ * for recycling mbufs.
+ */
+ rc = rte_eth_recycle_rx_queue_info_get(fs->rx_port,
+ fs->rx_queue, &(fs->recycle_rxq_info));
+ if (rc != 0)
+ TESTPMD_LOG(WARNING,
+ "Failed to get rx queue mbufs recycle info\n");
+
+ common_fwd_stream_init(fs);
+}
+
+struct fwd_engine recycle_mbufs_engine = {
+ .fwd_mode_name = "recycle_mbufs",
+ .stream_init = recycle_mbufs_stream_init,
+ .packet_fwd = pkt_burst_recycle_mbufs,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 938ca035d4..5d0f9ca119 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -199,6 +199,7 @@ struct fwd_engine * fwd_engines[] = {
&icmp_echo_engine,
&noisy_vnf_engine,
&five_tuple_swap_fwd_engine,
+ &recycle_mbufs_engine,
#ifdef RTE_LIBRTE_IEEE1588
&ieee1588_fwd_engine,
#endif
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f1df6a8faf..0eb8d7883a 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -188,6 +188,8 @@ struct fwd_stream {
struct pkt_burst_stats rx_burst_stats;
struct pkt_burst_stats tx_burst_stats;
struct fwd_lcore *lcore; /**< Lcore being scheduled. */
+ /**< Rx queue information for recycling mbufs */
+ struct rte_eth_recycle_rxq_info recycle_rxq_info;
};
/**
@@ -449,6 +451,7 @@ extern struct fwd_engine csum_fwd_engine;
extern struct fwd_engine icmp_echo_engine;
extern struct fwd_engine noisy_vnf_engine;
extern struct fwd_engine five_tuple_swap_fwd_engine;
+extern struct fwd_engine recycle_mbufs_engine;
#ifdef RTE_LIBRTE_IEEE1588
extern struct fwd_engine ieee1588_fwd_engine;
#endif
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6e9c552e76..24a086401e 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -232,6 +232,7 @@ The command line options are:
noisy
5tswap
shared-rxq
+ recycle_mbufs
* ``--rss-ip``
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index a182479ab2..aef4de3e0e 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -318,7 +318,7 @@ set fwd
Set the packet forwarding mode::
testpmd> set fwd (io|mac|macswap|flowgen| \
- rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
+ rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq|recycle_mbufs) (""|retry)
``retry`` can be specified for forwarding engines except ``rx_only``.
@@ -364,6 +364,9 @@ The available information categories are:
* ``shared-rxq``: Receive only for shared Rx queue.
Resolve packet source port from mbuf and update stream statistics accordingly.
+* ``recycle_mbufs``: Recycle Tx queue used mbufs for Rx queue mbuf ring.
+ This mode uses fast path mbuf recycle feature and forwards packets in I/O mode.
+
Example::
testpmd> set fwd rxonly
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (8 preceding siblings ...)
2023-08-02 7:38 ` [PATCH v8 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-02 8:08 ` Feifei Wang
2023-08-02 8:08 ` [PATCH v9 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
` (3 more replies)
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (3 subsequent siblings)
13 siblings, 4 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 8:08 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as mbufs
recycle mode.
In the latest version, mbufs recycle mode is packaged as a separate API.
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing at run-time and user does not need to
know the direction of flow in advance. This can effectively expand mbufs recycle mode's
use scenarios.
Furthermore, mbufs recycle mode is no longer limited to the same pmd,
it can support moving mbufs between different vendor pmds, even can put the mbufs
anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by mbufs recycle mode.
Difference between mbuf recycle, ZC API used in mempool and general path
For general path:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
So, mbufs recycle has its own benefits.
Testing status:
(1) dpdk l3fwd test with multiple drivers:
port 0: 82599 NIC port 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Thunderx2: +7.53% +13.54%
-------------------------------------------------------------
(2) dpdk l3fwd test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Ampere altra: +12.61% +11.42%
n1sdp: +8.30% +3.85%
x86-sse: +8.43% +3.72%
-------------------------------------------------------------
(3) Performance comparison with ZC_mempool used
port 0 && 1: XL710 NIC
with fast free
-------------------------------------------------------------
With recycle buffer With zc_mempool
Ampere altra: 11.42% 3.54%
-------------------------------------------------------------
Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
I/O bottleneck in testpmd in ampere altra, we can not see throughput change
compared with I/O fwd engine. However, using record cmd in testpmd:
'$set record-burst-stats on'
we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
indicate mbufs recycle can save CPU cycles.
V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm
v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)
v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path
v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)
v7:
1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
in order to keep them in the same cache line as rx/tx_burst function
pointers (Morten)
2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
support feeding 1 Rx queue from 2 Tx queues in the same thread
(Konstantin)
3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
in testpmd fwd engine (Morten)
v8:
1. add arm/x86 build option to fix ixgbe build issue in ppc
v9:
1. delete duplicate file name for ixgbe
Feifei Wang (4):
ethdev: add API for mbufs recycle mode
net/i40e: implement mbufs recycle mode
net/ixgbe: implement mbufs recycle mode
app/testpmd: add recycle mbufs engine
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 ++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 +
doc/guides/rel_notes/release_23_11.rst | 15 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 +++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
lib/ethdev/ethdev_driver.h | 10 +
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++
lib/ethdev/rte_ethdev.h | 181 ++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 ++-
lib/ethdev/version.map | 4 +
25 files changed, 703 insertions(+), 9 deletions(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v9 1/4] ethdev: add API for mbufs recycle mode
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-02 8:08 ` Feifei Wang
2023-08-02 8:08 ` [PATCH v9 2/4] net/i40e: implement " Feifei Wang
` (2 subsequent siblings)
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 8:08 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang,
Morten Brørup
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=quit, Size: 15883 bytes --]
Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.
For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/rel_notes/release_23_11.rst | 15 ++
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++++
lib/ethdev/rte_ethdev.h | 181 +++++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 +++-
lib/ethdev/version.map | 4 +
7 files changed, 260 insertions(+), 6 deletions(-)
diff --git a/doc/guides/rel_notes/release_23_11.rst b/doc/guides/rel_notes/release_23_11.rst
index 6b4dd21fd0..fd16d267ae 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -55,6 +55,13 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Add mbufs recycling support. **
+
+ Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+ APIs which allow the user to copy used mbufs from the Tx mbuf ring
+ into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+ device is different from the Tx Ethernet device with respective driver
+ callback functions in ``rte_eth_recycle_mbufs``.
Removed Items
-------------
@@ -100,6 +107,14 @@ ABI Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* ethdev: Added ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields to ``rte_eth_dev`` structure.
+
+* ethdev: Structure ``rte_eth_fp_ops`` was affected to add
+ ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields, to move ``rxq`` and ``txq`` fields, to change the size of
+ ``reserved1`` and ``reserved2`` fields.
+
Known Issues
------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 980f837ab6..b0c55a8523 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Pointer to PMD transmit mbufs reuse function */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ /** Pointer to PMD receive descriptors refill function */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
/**
* Device data that is shared between primary and secondary processes
@@ -507,6 +511,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
@@ -1250,6 +1258,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+ /** Retrieve mbufs recycle Rx queue information */
+ eth_recycle_rxq_info_get_t recycle_rxq_info_get;
eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+ fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
fpo->rxq.data = dev->data->rx_queues;
fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 0840d2b594..ea89a101a1 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5876,6 +5876,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
return 0;
}
+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+ return 0;
+}
+
int
rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04a2564f22..9dc5749d83 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
} __rte_cache_min_aligned;
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+ struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+ struct rte_mempool *mp; /**< mempool of Rx queue. */
+ uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
+ uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
+ uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
+ /**
+ * Requirement on mbuf refilling batch size of Rx mbuf ring.
+ * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+ * should be aligned with mbuf ring size, in order to simplify
+ * ring wrapping around.
+ * Value 0 means that PMD drivers have no requirement for this.
+ */
+ uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
/* Generic Burst mode flag definition, values can be ORed. */
/**
@@ -4853,6 +4877,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet devicefor which information
+ * will be retrieved.
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+ uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
/**
* Retrieve information about the Rx packet burst mode.
*
@@ -6527,6 +6576,138 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ * from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can support to feed 1 Rx queue from
+ * 2 Tx queues in the same thread. Do not pair the Rx queue and Tx queue in different
+ * threads, in order to avoid memory error rewriting.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ * the information of the Rx queue mbuf ring.
+ * @return
+ * The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+ uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (tx_port_id >= RTE_MAX_ETHPORTS ||
+ tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid tx_port_id=%u or tx_queue_id=%u\n",
+ tx_port_id, tx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[tx_port_id];
+ qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ tx_queue_id, tx_port_id);
+ return 0;
+ }
+#endif
+ if (p->recycle_tx_mbufs_reuse == NULL)
+ return 0;
+
+ /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+ * into Rx mbuf ring.
+ */
+ nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+ /* If no recycling mbufs, return 0. */
+ if (nb_mbufs == 0)
+ return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (rx_port_id >= RTE_MAX_ETHPORTS ||
+ rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+ rx_port_id, rx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[rx_port_id];
+ qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ rx_queue_id, rx_port_id);
+ return 0;
+ }
+#endif
+
+ if (p->recycle_rx_descriptors_refill == NULL)
+ return 0;
+
+ /* Replenish the Rx descriptors with the recycling
+ * into Rx mbuf ring.
+ */
+ p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+ return nb_mbufs;
+}
+
/**
* @warning
* @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index 46e9721e07..a24ad7a6b2 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -55,6 +55,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
/** @internal Check the status of a Tx descriptor */
typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
/**
* @internal
* Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -83,15 +90,17 @@ struct rte_eth_fp_ops {
* Rx fast-path functions and related data.
* 64-bit systems: occupies first 64B line
*/
+ /** Rx queues data. */
+ struct rte_ethdev_qdata rxq;
/** PMD receive function. */
eth_rx_burst_t rx_pkt_burst;
/** Get the number of used Rx descriptors. */
eth_rx_queue_count_t rx_queue_count;
/** Check the status of a Rx descriptor. */
eth_rx_descriptor_status_t rx_descriptor_status;
- /** Rx queues data. */
- struct rte_ethdev_qdata rxq;
- uintptr_t reserved1[3];
+ /** Refill Rx descriptors with the recycling mbufs. */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
+ uintptr_t reserved1[2];
/**@}*/
/**@{*/
@@ -99,15 +108,17 @@ struct rte_eth_fp_ops {
* Tx fast-path functions and related data.
* 64-bit systems: occupies second 64B line
*/
+ /** Tx queues data. */
+ struct rte_ethdev_qdata txq;
/** PMD transmit function. */
eth_tx_burst_t tx_pkt_burst;
/** PMD transmit prepare function. */
eth_tx_prep_t tx_pkt_prepare;
/** Check the status of a Tx descriptor. */
eth_tx_descriptor_status_t tx_descriptor_status;
- /** Tx queues data. */
- struct rte_ethdev_qdata txq;
- uintptr_t reserved2[3];
+ /** Copy used mbufs from Tx mbuf ring into Rx. */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ uintptr_t reserved2[2];
/**@}*/
} __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index b965d6aa52..e52c1563b4 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -312,6 +312,10 @@ EXPERIMENTAL {
rte_flow_async_action_list_handle_query_update;
rte_flow_async_actions_update;
rte_flow_restore_info_dynflag;
+
+ # added in 23.11
+ rte_eth_recycle_mbufs;
+ rte_eth_recycle_rx_queue_info_get;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v9 2/4] net/i40e: implement mbufs recycle mode
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-02 8:08 ` [PATCH v9 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-08-02 8:08 ` Feifei Wang
2023-08-02 8:08 ` [PATCH v9 3/4] net/ixgbe: " Feifei Wang
2023-08-02 8:08 ` [PATCH v9 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 8:08 UTC (permalink / raw)
To: Yuying Zhang, Beilei Xing
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
6 files changed, 187 insertions(+)
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 8271bbb394..50ba9aac94 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+ .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
.rx_burst_mode_get = i40e_rx_burst_mode_get,
.tx_burst_mode_get = i40e_tx_burst_mode_get,
.timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 6f65d5e0ac..af758798e1 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_burst_mode *mode);
int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..5663ecccde
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* flush desc with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = 0;
+ rx_id = rxq->nb_rx_desc - 1;
+ }
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle mode can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * mbuf ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with mbuf ring size. In this case, the update of refill head can not
+ * exceed the Rx mbuf ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index b4f65b58fa..a9c9eb331c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3199,6 +3199,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.offloads = txq->offloads;
}
+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_rx_queue *rxq;
+ struct i40e_adapter *ad =
+ I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mp;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (ad->rx_vec_allowed) {
+ recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
#ifdef RTE_ARCH_X86
static inline bool
get_avx_supported(bool request_avx512)
@@ -3293,6 +3317,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_scattered_pkts_vec_avx2 :
i40e_recv_scattered_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
} else {
if (ad->rx_use_avx512) {
@@ -3311,9 +3337,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_pkts_vec_avx2 :
i40e_recv_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
}
#else /* RTE_ARCH_X86 */
+ dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
if (dev->data->scattered_rx) {
PMD_INIT_LOG(DEBUG,
"Using Vector Scattered Rx (port %d).",
@@ -3481,15 +3510,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
dev->tx_pkt_burst = ad->tx_use_avx2 ?
i40e_xmit_pkts_vec_avx2 :
i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
#else /* RTE_ARCH_X86 */
PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
dev->data->port_id);
dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
#endif /* RTE_ARCH_X86 */
} else {
PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
dev->tx_pkt_prepare = i40e_simple_prep_pkts;
} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index a8686224e5..b191f23e1f 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -236,6 +236,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..3b1a233c84 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -34,6 +34,7 @@ sources = files(
'i40e_tm.c',
'i40e_hash.c',
'i40e_vf_representor.c',
+ 'i40e_recycle_mbufs_vec_common.c',
'rte_pmd_i40e.c',
)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v9 3/4] net/ixgbe: implement mbufs recycle mode
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-02 8:08 ` [PATCH v9 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
2023-08-02 8:08 ` [PATCH v9 2/4] net/i40e: implement " Feifei Wang
@ 2023-08-02 8:08 ` Feifei Wang
2023-08-02 8:08 ` [PATCH v9 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 8:08 UTC (permalink / raw)
To: Qiming Yang, Wenjun Wu
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 ++++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
6 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 14a7d571e0..ea4c9dd561 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+ .recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
.timesync_enable = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 1291e9099c..22fc3be3d8 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -626,6 +626,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..9a8cc86954
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct ixgbe_rx_queue *rxq = rx_queue;
+ struct ixgbe_rx_entry *rxep;
+ volatile union ixgbe_adv_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* Flush descriptors with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+ rxq->rxrearm_start = 0;
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+ (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+ /* Update the tail pointer on the NIC */
+ IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_tx_queue *txq = tx_queue;
+ struct ixgbe_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint32_t status;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ status = txq->tx_ring[txq->tx_next_dd].wb.status;
+ if (!(status & IXGBE_ADVTXD_STAT_DD))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * buffer ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with buffer ring size. In this case, the update of refill head can not
+ * exceed the Rx buffer ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 954ef241a0..90b0a7004f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2552,6 +2552,9 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
(rte_eal_process_type() != RTE_PROC_PRIMARY ||
ixgbe_txq_vec_setup(txq) == 0)) {
PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_tx_mbufs_reuse = ixgbe_recycle_tx_mbufs_reuse_vec;
+#endif
dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
} else
dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4890,7 +4893,10 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Scattered Rx "
"callback (port=%d).",
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill =
+ ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4919,7 +4925,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
"burst size no less than %d (port=%d).",
RTE_IXGBE_DESCS_PER_LOOP,
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill = ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5691,6 +5699,31 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
}
+void
+ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_rx_queue *rxq;
+ struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mb_pool;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (adapter->rx_vec_allowed) {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ recycle_rxq_info->refill_requirement = RTE_IXGBE_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+#endif
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
/*
* [VF] Initializes Receive Unit.
*/
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..ee89c89929 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
+uint16_t ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/meson.build b/drivers/net/ixgbe/meson.build
index a18908ef7c..0ae12dd5ff 100644
--- a/drivers/net/ixgbe/meson.build
+++ b/drivers/net/ixgbe/meson.build
@@ -26,11 +26,13 @@ deps += ['hash', 'security']
if arch_subdir == 'x86'
sources += files('ixgbe_rxtx_vec_sse.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
if is_windows and cc.get_id() != 'clang'
cflags += ['-fno-asynchronous-unwind-tables']
endif
elif arch_subdir == 'arm'
sources += files('ixgbe_rxtx_vec_neon.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
endif
includes += include_directories('base')
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v9 4/4] app/testpmd: add recycle mbufs engine
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (2 preceding siblings ...)
2023-08-02 8:08 ` [PATCH v9 3/4] net/ixgbe: " Feifei Wang
@ 2023-08-02 8:08 ` Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-02 8:08 UTC (permalink / raw)
To: Aman Singh, Yuying Zhang; +Cc: dev, nd, Feifei Wang, Jerin Jacob, Ruifeng Wang
Add recycle mbufs engine for testpmd. This engine forward pkts with
I/O forward mode. But enable mbufs recycle feature to recycle used
txq mbufs for rxq mbuf ring, which can bypass mempool path and save
CPU cycles.
Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 +++++++++++++++++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
6 files changed, 68 insertions(+), 1 deletion(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index d2e3f60892..6e5f067274 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
'macswap.c',
'noisy_vnf.c',
'parameters.c',
+ 'recycle_mbufs.c',
'rxonly.c',
'shared_rxq_fwd.c',
'testpmd.c',
diff --git a/app/test-pmd/recycle_mbufs.c b/app/test-pmd/recycle_mbufs.c
new file mode 100644
index 0000000000..6e9e1c5eb6
--- /dev/null
+++ b/app/test-pmd/recycle_mbufs.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include "testpmd.h"
+
+/*
+ * Forwarding of packets in I/O mode.
+ * Enable mbufs recycle mode to recycle txq used mbufs
+ * for rxq mbuf ring. This can bypass mempool path and
+ * save CPU cycles.
+ */
+static bool
+pkt_burst_recycle_mbufs(struct fwd_stream *fs)
+{
+ struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+ uint16_t nb_rx;
+
+ /* Recycle used mbufs from the txq, and move these mbufs into
+ * the rxq mbuf ring.
+ */
+ rte_eth_recycle_mbufs(fs->rx_port, fs->rx_queue,
+ fs->tx_port, fs->tx_queue, &(fs->recycle_rxq_info));
+
+ /*
+ * Receive a burst of packets and forward them.
+ */
+ nb_rx = common_fwd_stream_receive(fs, pkts_burst, nb_pkt_per_burst);
+ if (unlikely(nb_rx == 0))
+ return false;
+
+ common_fwd_stream_transmit(fs, pkts_burst, nb_rx);
+
+ return true;
+}
+
+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+ int rc;
+
+ /* Retrieve information about given ports's Rx queue
+ * for recycling mbufs.
+ */
+ rc = rte_eth_recycle_rx_queue_info_get(fs->rx_port,
+ fs->rx_queue, &(fs->recycle_rxq_info));
+ if (rc != 0)
+ TESTPMD_LOG(WARNING,
+ "Failed to get rx queue mbufs recycle info\n");
+
+ common_fwd_stream_init(fs);
+}
+
+struct fwd_engine recycle_mbufs_engine = {
+ .fwd_mode_name = "recycle_mbufs",
+ .stream_init = recycle_mbufs_stream_init,
+ .packet_fwd = pkt_burst_recycle_mbufs,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 938ca035d4..5d0f9ca119 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -199,6 +199,7 @@ struct fwd_engine * fwd_engines[] = {
&icmp_echo_engine,
&noisy_vnf_engine,
&five_tuple_swap_fwd_engine,
+ &recycle_mbufs_engine,
#ifdef RTE_LIBRTE_IEEE1588
&ieee1588_fwd_engine,
#endif
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f1df6a8faf..0eb8d7883a 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -188,6 +188,8 @@ struct fwd_stream {
struct pkt_burst_stats rx_burst_stats;
struct pkt_burst_stats tx_burst_stats;
struct fwd_lcore *lcore; /**< Lcore being scheduled. */
+ /**< Rx queue information for recycling mbufs */
+ struct rte_eth_recycle_rxq_info recycle_rxq_info;
};
/**
@@ -449,6 +451,7 @@ extern struct fwd_engine csum_fwd_engine;
extern struct fwd_engine icmp_echo_engine;
extern struct fwd_engine noisy_vnf_engine;
extern struct fwd_engine five_tuple_swap_fwd_engine;
+extern struct fwd_engine recycle_mbufs_engine;
#ifdef RTE_LIBRTE_IEEE1588
extern struct fwd_engine ieee1588_fwd_engine;
#endif
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6e9c552e76..24a086401e 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -232,6 +232,7 @@ The command line options are:
noisy
5tswap
shared-rxq
+ recycle_mbufs
* ``--rss-ip``
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index a182479ab2..aef4de3e0e 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -318,7 +318,7 @@ set fwd
Set the packet forwarding mode::
testpmd> set fwd (io|mac|macswap|flowgen| \
- rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
+ rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq|recycle_mbufs) (""|retry)
``retry`` can be specified for forwarding engines except ``rx_only``.
@@ -364,6 +364,9 @@ The available information categories are:
* ``shared-rxq``: Receive only for shared Rx queue.
Resolve packet source port from mbuf and update stream statistics accordingly.
+* ``recycle_mbufs``: Recycle Tx queue used mbufs for Rx queue mbuf ring.
+ This mode uses fast path mbuf recycle feature and forwards packets in I/O mode.
+
Example::
testpmd> set fwd rxonly
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (9 preceding siblings ...)
2023-08-02 8:08 ` [PATCH v9 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-04 9:24 ` Feifei Wang
2023-08-04 9:24 ` [PATCH v10 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
` (3 more replies)
2023-08-22 7:27 ` [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (2 subsequent siblings)
13 siblings, 4 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-04 9:24 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as mbufs
recycle mode.
In the latest version, mbufs recycle mode is packaged as a separate API.
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing at run-time and user does not need to
know the direction of flow in advance. This can effectively expand mbufs recycle mode's
use scenarios.
Furthermore, mbufs recycle mode is no longer limited to the same pmd,
it can support moving mbufs between different vendor pmds, even can put the mbufs
anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by mbufs recycle mode.
Difference between mbuf recycle, ZC API used in mempool and general path
For general path:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
So, mbufs recycle has its own benefits.
Testing status:
(1) dpdk l3fwd test with multiple drivers:
port 0: 82599 NIC port 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Thunderx2: +7.53% +13.54%
-------------------------------------------------------------
(2) dpdk l3fwd test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Ampere altra: +12.61% +11.42%
n1sdp: +8.30% +3.85%
x86-sse: +8.43% +3.72%
-------------------------------------------------------------
(3) Performance comparison with ZC_mempool used
port 0 && 1: XL710 NIC
with fast free
-------------------------------------------------------------
With recycle buffer With zc_mempool
Ampere altra: 11.42% 3.54%
-------------------------------------------------------------
Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
I/O bottleneck in testpmd in ampere altra, we can not see throughput change
compared with I/O fwd engine. However, using record cmd in testpmd:
'$set record-burst-stats on'
we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
indicate mbufs recycle can save CPU cycles.
V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm
v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)
v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path
v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)
v7:
1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
in order to keep them in the same cache line as rx/tx_burst function
pointers (Morten)
2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
support feeding 1 Rx queue from 2 Tx queues in the same thread
(Konstantin)
3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
in testpmd fwd engine (Morten)
v8:
1. add arm/x86 build option to fix ixgbe build issue in ppc
v9:
1. delete duplicate file name for ixgbe
v10:
1. fix compile issue on windows
Feifei Wang (4):
ethdev: add API for mbufs recycle mode
net/i40e: implement mbufs recycle mode
net/ixgbe: implement mbufs recycle mode
app/testpmd: add recycle mbufs engine
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 ++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 +
doc/guides/rel_notes/release_23_11.rst | 15 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 +++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
lib/ethdev/ethdev_driver.h | 10 +
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++
lib/ethdev/rte_ethdev.h | 181 ++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 ++-
lib/ethdev/version.map | 3 +
25 files changed, 702 insertions(+), 9 deletions(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v10 1/4] ethdev: add API for mbufs recycle mode
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-04 9:24 ` Feifei Wang
2023-08-04 9:24 ` [PATCH v10 2/4] net/i40e: implement " Feifei Wang
` (2 subsequent siblings)
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-04 9:24 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang,
Morten Brørup
Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.
For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/rel_notes/release_23_11.rst | 15 ++
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++++
lib/ethdev/rte_ethdev.h | 181 +++++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 +++-
lib/ethdev/version.map | 3 +
7 files changed, 259 insertions(+), 6 deletions(-)
diff --git a/doc/guides/rel_notes/release_23_11.rst b/doc/guides/rel_notes/release_23_11.rst
index 6b4dd21fd0..fd16d267ae 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -55,6 +55,13 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Add mbufs recycling support. **
+
+ Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+ APIs which allow the user to copy used mbufs from the Tx mbuf ring
+ into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+ device is different from the Tx Ethernet device with respective driver
+ callback functions in ``rte_eth_recycle_mbufs``.
Removed Items
-------------
@@ -100,6 +107,14 @@ ABI Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* ethdev: Added ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields to ``rte_eth_dev`` structure.
+
+* ethdev: Structure ``rte_eth_fp_ops`` was affected to add
+ ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields, to move ``rxq`` and ``txq`` fields, to change the size of
+ ``reserved1`` and ``reserved2`` fields.
+
Known Issues
------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 980f837ab6..b0c55a8523 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Pointer to PMD transmit mbufs reuse function */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ /** Pointer to PMD receive descriptors refill function */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
/**
* Device data that is shared between primary and secondary processes
@@ -507,6 +511,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
@@ -1250,6 +1258,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+ /** Retrieve mbufs recycle Rx queue information */
+ eth_recycle_rxq_info_get_t recycle_rxq_info_get;
eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+ fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
fpo->rxq.data = dev->data->rx_queues;
fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 0840d2b594..ea89a101a1 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5876,6 +5876,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
return 0;
}
+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+ return 0;
+}
+
int
rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04a2564f22..9dc5749d83 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
} __rte_cache_min_aligned;
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+ struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+ struct rte_mempool *mp; /**< mempool of Rx queue. */
+ uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
+ uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
+ uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
+ /**
+ * Requirement on mbuf refilling batch size of Rx mbuf ring.
+ * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+ * should be aligned with mbuf ring size, in order to simplify
+ * ring wrapping around.
+ * Value 0 means that PMD drivers have no requirement for this.
+ */
+ uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
/* Generic Burst mode flag definition, values can be ORed. */
/**
@@ -4853,6 +4877,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet devicefor which information
+ * will be retrieved.
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+ uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
/**
* Retrieve information about the Rx packet burst mode.
*
@@ -6527,6 +6576,138 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ * from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can support to feed 1 Rx queue from
+ * 2 Tx queues in the same thread. Do not pair the Rx queue and Tx queue in different
+ * threads, in order to avoid memory error rewriting.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ * the information of the Rx queue mbuf ring.
+ * @return
+ * The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+ uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (tx_port_id >= RTE_MAX_ETHPORTS ||
+ tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid tx_port_id=%u or tx_queue_id=%u\n",
+ tx_port_id, tx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[tx_port_id];
+ qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ tx_queue_id, tx_port_id);
+ return 0;
+ }
+#endif
+ if (p->recycle_tx_mbufs_reuse == NULL)
+ return 0;
+
+ /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+ * into Rx mbuf ring.
+ */
+ nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+ /* If no recycling mbufs, return 0. */
+ if (nb_mbufs == 0)
+ return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (rx_port_id >= RTE_MAX_ETHPORTS ||
+ rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+ rx_port_id, rx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[rx_port_id];
+ qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ rx_queue_id, rx_port_id);
+ return 0;
+ }
+#endif
+
+ if (p->recycle_rx_descriptors_refill == NULL)
+ return 0;
+
+ /* Replenish the Rx descriptors with the recycling
+ * into Rx mbuf ring.
+ */
+ p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+ return nb_mbufs;
+}
+
/**
* @warning
* @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index 46e9721e07..a24ad7a6b2 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -55,6 +55,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
/** @internal Check the status of a Tx descriptor */
typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
/**
* @internal
* Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -83,15 +90,17 @@ struct rte_eth_fp_ops {
* Rx fast-path functions and related data.
* 64-bit systems: occupies first 64B line
*/
+ /** Rx queues data. */
+ struct rte_ethdev_qdata rxq;
/** PMD receive function. */
eth_rx_burst_t rx_pkt_burst;
/** Get the number of used Rx descriptors. */
eth_rx_queue_count_t rx_queue_count;
/** Check the status of a Rx descriptor. */
eth_rx_descriptor_status_t rx_descriptor_status;
- /** Rx queues data. */
- struct rte_ethdev_qdata rxq;
- uintptr_t reserved1[3];
+ /** Refill Rx descriptors with the recycling mbufs. */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
+ uintptr_t reserved1[2];
/**@}*/
/**@{*/
@@ -99,15 +108,17 @@ struct rte_eth_fp_ops {
* Tx fast-path functions and related data.
* 64-bit systems: occupies second 64B line
*/
+ /** Tx queues data. */
+ struct rte_ethdev_qdata txq;
/** PMD transmit function. */
eth_tx_burst_t tx_pkt_burst;
/** PMD transmit prepare function. */
eth_tx_prep_t tx_pkt_prepare;
/** Check the status of a Tx descriptor. */
eth_tx_descriptor_status_t tx_descriptor_status;
- /** Tx queues data. */
- struct rte_ethdev_qdata txq;
- uintptr_t reserved2[3];
+ /** Copy used mbufs from Tx mbuf ring into Rx. */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ uintptr_t reserved2[2];
/**@}*/
} __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index b965d6aa52..eec159dfdd 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -312,6 +312,9 @@ EXPERIMENTAL {
rte_flow_async_action_list_handle_query_update;
rte_flow_async_actions_update;
rte_flow_restore_info_dynflag;
+
+ # added in 23.11
+ rte_eth_recycle_rx_queue_info_get;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v10 2/4] net/i40e: implement mbufs recycle mode
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-04 9:24 ` [PATCH v10 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-08-04 9:24 ` Feifei Wang
2023-08-04 9:24 ` [PATCH v10 3/4] net/ixgbe: " Feifei Wang
2023-08-04 9:24 ` [PATCH v10 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-04 9:24 UTC (permalink / raw)
To: Yuying Zhang, Beilei Xing
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
6 files changed, 187 insertions(+)
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 8271bbb394..50ba9aac94 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+ .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
.rx_burst_mode_get = i40e_rx_burst_mode_get,
.tx_burst_mode_get = i40e_tx_burst_mode_get,
.timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 6f65d5e0ac..af758798e1 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_burst_mode *mode);
int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..5663ecccde
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* flush desc with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = 0;
+ rx_id = rxq->nb_rx_desc - 1;
+ }
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle mode can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * mbuf ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with mbuf ring size. In this case, the update of refill head can not
+ * exceed the Rx mbuf ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index b4f65b58fa..a9c9eb331c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3199,6 +3199,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.offloads = txq->offloads;
}
+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_rx_queue *rxq;
+ struct i40e_adapter *ad =
+ I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mp;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (ad->rx_vec_allowed) {
+ recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
#ifdef RTE_ARCH_X86
static inline bool
get_avx_supported(bool request_avx512)
@@ -3293,6 +3317,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_scattered_pkts_vec_avx2 :
i40e_recv_scattered_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
} else {
if (ad->rx_use_avx512) {
@@ -3311,9 +3337,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_pkts_vec_avx2 :
i40e_recv_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
}
#else /* RTE_ARCH_X86 */
+ dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
if (dev->data->scattered_rx) {
PMD_INIT_LOG(DEBUG,
"Using Vector Scattered Rx (port %d).",
@@ -3481,15 +3510,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
dev->tx_pkt_burst = ad->tx_use_avx2 ?
i40e_xmit_pkts_vec_avx2 :
i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
#else /* RTE_ARCH_X86 */
PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
dev->data->port_id);
dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
#endif /* RTE_ARCH_X86 */
} else {
PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
dev->tx_pkt_prepare = i40e_simple_prep_pkts;
} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index a8686224e5..b191f23e1f 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -236,6 +236,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..3b1a233c84 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -34,6 +34,7 @@ sources = files(
'i40e_tm.c',
'i40e_hash.c',
'i40e_vf_representor.c',
+ 'i40e_recycle_mbufs_vec_common.c',
'rte_pmd_i40e.c',
)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v10 3/4] net/ixgbe: implement mbufs recycle mode
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-04 9:24 ` [PATCH v10 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
2023-08-04 9:24 ` [PATCH v10 2/4] net/i40e: implement " Feifei Wang
@ 2023-08-04 9:24 ` Feifei Wang
2023-08-04 9:24 ` [PATCH v10 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-04 9:24 UTC (permalink / raw)
To: Qiming Yang, Wenjun Wu
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 ++++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
6 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 14a7d571e0..ea4c9dd561 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+ .recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
.timesync_enable = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 1291e9099c..22fc3be3d8 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -626,6 +626,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..9a8cc86954
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct ixgbe_rx_queue *rxq = rx_queue;
+ struct ixgbe_rx_entry *rxep;
+ volatile union ixgbe_adv_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* Flush descriptors with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+ rxq->rxrearm_start = 0;
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+ (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+ /* Update the tail pointer on the NIC */
+ IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_tx_queue *txq = tx_queue;
+ struct ixgbe_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint32_t status;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ status = txq->tx_ring[txq->tx_next_dd].wb.status;
+ if (!(status & IXGBE_ADVTXD_STAT_DD))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * buffer ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with buffer ring size. In this case, the update of refill head can not
+ * exceed the Rx buffer ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 954ef241a0..90b0a7004f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2552,6 +2552,9 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
(rte_eal_process_type() != RTE_PROC_PRIMARY ||
ixgbe_txq_vec_setup(txq) == 0)) {
PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_tx_mbufs_reuse = ixgbe_recycle_tx_mbufs_reuse_vec;
+#endif
dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
} else
dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4890,7 +4893,10 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Scattered Rx "
"callback (port=%d).",
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill =
+ ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4919,7 +4925,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
"burst size no less than %d (port=%d).",
RTE_IXGBE_DESCS_PER_LOOP,
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill = ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5691,6 +5699,31 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
}
+void
+ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_rx_queue *rxq;
+ struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mb_pool;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (adapter->rx_vec_allowed) {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ recycle_rxq_info->refill_requirement = RTE_IXGBE_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+#endif
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
/*
* [VF] Initializes Receive Unit.
*/
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.h b/drivers/net/ixgbe/ixgbe_rxtx.h
index 668a5b9814..ee89c89929 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.h
+++ b/drivers/net/ixgbe/ixgbe_rxtx.h
@@ -295,6 +295,10 @@ int ixgbe_dev_tx_done_cleanup(void *tx_queue, uint32_t free_cnt);
extern const uint32_t ptype_table[IXGBE_PACKET_TYPE_MAX];
extern const uint32_t ptype_table_tn[IXGBE_PACKET_TYPE_TN_MAX];
+uint16_t ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t ixgbe_xmit_fixed_burst_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
uint16_t nb_pkts);
int ixgbe_txq_vec_setup(struct ixgbe_tx_queue *txq);
diff --git a/drivers/net/ixgbe/meson.build b/drivers/net/ixgbe/meson.build
index a18908ef7c..0ae12dd5ff 100644
--- a/drivers/net/ixgbe/meson.build
+++ b/drivers/net/ixgbe/meson.build
@@ -26,11 +26,13 @@ deps += ['hash', 'security']
if arch_subdir == 'x86'
sources += files('ixgbe_rxtx_vec_sse.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
if is_windows and cc.get_id() != 'clang'
cflags += ['-fno-asynchronous-unwind-tables']
endif
elif arch_subdir == 'arm'
sources += files('ixgbe_rxtx_vec_neon.c')
+ sources += files('ixgbe_recycle_mbufs_vec_common.c')
endif
includes += include_directories('base')
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v10 4/4] app/testpmd: add recycle mbufs engine
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
` (2 preceding siblings ...)
2023-08-04 9:24 ` [PATCH v10 3/4] net/ixgbe: " Feifei Wang
@ 2023-08-04 9:24 ` Feifei Wang
3 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-04 9:24 UTC (permalink / raw)
To: Aman Singh, Yuying Zhang; +Cc: dev, nd, Feifei Wang, Jerin Jacob, Ruifeng Wang
Add recycle mbufs engine for testpmd. This engine forward pkts with
I/O forward mode. But enable mbufs recycle feature to recycle used
txq mbufs for rxq mbuf ring, which can bypass mempool path and save
CPU cycles.
Suggested-by: Jerin Jacob <jerinjacobk@gmail.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 +++++++++++++++++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
6 files changed, 68 insertions(+), 1 deletion(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index d2e3f60892..6e5f067274 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
'macswap.c',
'noisy_vnf.c',
'parameters.c',
+ 'recycle_mbufs.c',
'rxonly.c',
'shared_rxq_fwd.c',
'testpmd.c',
diff --git a/app/test-pmd/recycle_mbufs.c b/app/test-pmd/recycle_mbufs.c
new file mode 100644
index 0000000000..6e9e1c5eb6
--- /dev/null
+++ b/app/test-pmd/recycle_mbufs.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include "testpmd.h"
+
+/*
+ * Forwarding of packets in I/O mode.
+ * Enable mbufs recycle mode to recycle txq used mbufs
+ * for rxq mbuf ring. This can bypass mempool path and
+ * save CPU cycles.
+ */
+static bool
+pkt_burst_recycle_mbufs(struct fwd_stream *fs)
+{
+ struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+ uint16_t nb_rx;
+
+ /* Recycle used mbufs from the txq, and move these mbufs into
+ * the rxq mbuf ring.
+ */
+ rte_eth_recycle_mbufs(fs->rx_port, fs->rx_queue,
+ fs->tx_port, fs->tx_queue, &(fs->recycle_rxq_info));
+
+ /*
+ * Receive a burst of packets and forward them.
+ */
+ nb_rx = common_fwd_stream_receive(fs, pkts_burst, nb_pkt_per_burst);
+ if (unlikely(nb_rx == 0))
+ return false;
+
+ common_fwd_stream_transmit(fs, pkts_burst, nb_rx);
+
+ return true;
+}
+
+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+ int rc;
+
+ /* Retrieve information about given ports's Rx queue
+ * for recycling mbufs.
+ */
+ rc = rte_eth_recycle_rx_queue_info_get(fs->rx_port,
+ fs->rx_queue, &(fs->recycle_rxq_info));
+ if (rc != 0)
+ TESTPMD_LOG(WARNING,
+ "Failed to get rx queue mbufs recycle info\n");
+
+ common_fwd_stream_init(fs);
+}
+
+struct fwd_engine recycle_mbufs_engine = {
+ .fwd_mode_name = "recycle_mbufs",
+ .stream_init = recycle_mbufs_stream_init,
+ .packet_fwd = pkt_burst_recycle_mbufs,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 938ca035d4..5d0f9ca119 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -199,6 +199,7 @@ struct fwd_engine * fwd_engines[] = {
&icmp_echo_engine,
&noisy_vnf_engine,
&five_tuple_swap_fwd_engine,
+ &recycle_mbufs_engine,
#ifdef RTE_LIBRTE_IEEE1588
&ieee1588_fwd_engine,
#endif
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f1df6a8faf..0eb8d7883a 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -188,6 +188,8 @@ struct fwd_stream {
struct pkt_burst_stats rx_burst_stats;
struct pkt_burst_stats tx_burst_stats;
struct fwd_lcore *lcore; /**< Lcore being scheduled. */
+ /**< Rx queue information for recycling mbufs */
+ struct rte_eth_recycle_rxq_info recycle_rxq_info;
};
/**
@@ -449,6 +451,7 @@ extern struct fwd_engine csum_fwd_engine;
extern struct fwd_engine icmp_echo_engine;
extern struct fwd_engine noisy_vnf_engine;
extern struct fwd_engine five_tuple_swap_fwd_engine;
+extern struct fwd_engine recycle_mbufs_engine;
#ifdef RTE_LIBRTE_IEEE1588
extern struct fwd_engine ieee1588_fwd_engine;
#endif
diff --git a/doc/guides/testpmd_app_ug/run_app.rst b/doc/guides/testpmd_app_ug/run_app.rst
index 6e9c552e76..24a086401e 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -232,6 +232,7 @@ The command line options are:
noisy
5tswap
shared-rxq
+ recycle_mbufs
* ``--rss-ip``
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index a182479ab2..aef4de3e0e 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -318,7 +318,7 @@ set fwd
Set the packet forwarding mode::
testpmd> set fwd (io|mac|macswap|flowgen| \
- rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq) (""|retry)
+ rxonly|txonly|csum|icmpecho|noisy|5tswap|shared-rxq|recycle_mbufs) (""|retry)
``retry`` can be specified for forwarding engines except ``rx_only``.
@@ -364,6 +364,9 @@ The available information categories are:
* ``shared-rxq``: Receive only for shared Rx queue.
Resolve packet source port from mbuf and update stream statistics accordingly.
+* ``recycle_mbufs``: Recycle Tx queue used mbufs for Rx queue mbuf ring.
+ This mode uses fast path mbuf recycle feature and forwards packets in I/O mode.
+
Example::
testpmd> set fwd rxonly
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue
2022-04-20 8:16 [PATCH v1 0/5] Direct re-arming of buffers on receive side Feifei Wang
` (10 preceding siblings ...)
2023-08-04 9:24 ` [PATCH v10 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-22 7:27 ` Feifei Wang
2023-08-22 7:27 ` [PATCH v11 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
` (5 more replies)
2023-08-24 7:36 ` [PATCH v12 " Feifei Wang
2023-09-25 3:19 ` [PATCH v13 " Feifei Wang
13 siblings, 6 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-22 7:27 UTC (permalink / raw)
Cc: dev, nd, Feifei Wang
Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.
This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into its software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache. And we can call this mode as mbufs
recycle mode.
In the latest version, mbufs recycle mode is packaged as a separate API.
This allows for the users to change rxq/txq pairing in real time in data plane,
according to the analysis of the packet flow by the application, for example:
-----------------------------------------------------------------------
Step 1: upper application analyse the flow direction
Step 2: recycle_rxq_info = rte_eth_recycle_rx_queue_info_get(rx_portid, rx_queueid)
Step 3: rte_eth_recycle_mbufs(rx_portid, rx_queueid, tx_portid, tx_queueid, recycle_rxq_info);
Step 4: rte_eth_rx_burst(rx_portid,rx_queueid);
Step 5: rte_eth_tx_burst(tx_portid,tx_queueid);
-----------------------------------------------------------------------
Above can support user to change rxq/txq pairing at run-time and user does not need to
know the direction of flow in advance. This can effectively expand mbufs recycle mode's
use scenarios.
Furthermore, mbufs recycle mode is no longer limited to the same pmd,
it can support moving mbufs between different vendor pmds, even can put the mbufs
anywhere into your Rx mbuf ring as long as the address of the mbuf ring can be provided.
In the latest version, we enable mbufs recycle mode in i40e pmd and ixgbe pmd, and also try to
use i40e driver in Rx, ixgbe driver in Tx, and then achieve 7-9% performance improvement
by mbufs recycle mode.
Difference between mbuf recycle, ZC API used in mempool and general path
For general path:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to temporary variable + 32 pkts memcpy from temporary variable to mempool cache
For ZC API used in mempool:
Rx: 32 pkts memcpy from mempool cache to rx_sw_ring
Tx: 32 pkts memcpy from tx_sw_ring to zero-copy mempool cache
Refer link: http://patches.dpdk.org/project/dpdk/patch/20230221055205.22984-2-kamalakshitha.aligeri@arm.com/
For mbufs recycle:
Rx/Tx: 32 pkts memcpy from tx_sw_ring to rx_sw_ring
Thus we can see in the one loop, compared to general path, mbufs recycle mode reduces 32+32=64 pkts memcpy;
Compared to ZC API used in mempool, we can see mbufs recycle mode reduce 32 pkts memcpy in each loop.
So, mbufs recycle has its own benefits.
Testing status:
(1) dpdk l3fwd test with multiple drivers:
port 0: 82599 NIC port 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Thunderx2: +7.53% +13.54%
-------------------------------------------------------------
(2) dpdk l3fwd test with same driver:
port 0 && 1: XL710 NIC
-------------------------------------------------------------
Without fast free With fast free
Ampere altra: +12.61% +11.42%
n1sdp: +8.30% +3.85%
x86-sse: +8.43% +3.72%
-------------------------------------------------------------
(3) Performance comparison with ZC_mempool used
port 0 && 1: XL710 NIC
with fast free
-------------------------------------------------------------
With recycle buffer With zc_mempool
Ampere altra: 11.42% 3.54%
-------------------------------------------------------------
Furthermore, we add recycle_mbuf engine in testpmd. Due to XL710 NIC has
I/O bottleneck in testpmd in ampere altra, we can not see throughput change
compared with I/O fwd engine. However, using record cmd in testpmd:
'$set record-burst-stats on'
we can see the ratio of 'Rx/Tx burst size of 32' is reduced. This
indicate mbufs recycle can save CPU cycles.
V2:
1. Use data-plane API to enable direct-rearm (Konstantin, Honnappa)
2. Add 'txq_data_get' API to get txq info for Rx (Konstantin)
3. Use input parameter to enable direct rearm in l3fwd (Konstantin)
4. Add condition detection for direct rearm API (Morten, Andrew Rybchenko)
V3:
1. Seperate Rx and Tx operation with two APIs in direct-rearm (Konstantin)
2. Delete L3fwd change for direct rearm (Jerin)
3. enable direct rearm in ixgbe driver in Arm
v4:
1. Rename direct-rearm as buffer recycle. Based on this, function name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)
v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path
v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)
v7:
1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
in order to keep them in the same cache line as rx/tx_burst function
pointers (Morten)
2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
support feeding 1 Rx queue from 2 Tx queues in the same thread
(Konstantin)
3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
in testpmd fwd engine (Morten)
v8:
1. add arm/x86 build option to fix ixgbe build issue in ppc
v9:
1. delete duplicate file name for ixgbe
v10:
1. fix compile issue on windows
v11:
1. fix doc warning
Feifei Wang (4):
ethdev: add API for mbufs recycle mode
net/i40e: implement mbufs recycle mode
net/ixgbe: implement mbufs recycle mode
app/testpmd: add recycle mbufs engine
app/test-pmd/meson.build | 1 +
app/test-pmd/recycle_mbufs.c | 58 ++++++
app/test-pmd/testpmd.c | 1 +
app/test-pmd/testpmd.h | 3 +
doc/guides/rel_notes/release_23_11.rst | 15 ++
doc/guides/testpmd_app_ug/run_app.rst | 1 +
doc/guides/testpmd_app_ug/testpmd_funcs.rst | 5 +-
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 +++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
lib/ethdev/ethdev_driver.h | 10 +
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++
lib/ethdev/rte_ethdev.h | 181 ++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 ++-
lib/ethdev/version.map | 3 +
25 files changed, 702 insertions(+), 9 deletions(-)
create mode 100644 app/test-pmd/recycle_mbufs.c
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v11 1/4] ethdev: add API for mbufs recycle mode
2023-08-22 7:27 ` [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
@ 2023-08-22 7:27 ` Feifei Wang
2023-08-22 14:02 ` Stephen Hemminger
` (2 more replies)
2023-08-22 7:27 ` [PATCH v11 2/4] net/i40e: implement " Feifei Wang
` (4 subsequent siblings)
5 siblings, 3 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-22 7:27 UTC (permalink / raw)
To: Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang,
Morten Brørup
Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.
For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
doc/guides/rel_notes/release_23_11.rst | 15 ++
lib/ethdev/ethdev_driver.h | 10 ++
lib/ethdev/ethdev_private.c | 2 +
lib/ethdev/rte_ethdev.c | 31 +++++
lib/ethdev/rte_ethdev.h | 181 +++++++++++++++++++++++++
lib/ethdev/rte_ethdev_core.h | 23 +++-
lib/ethdev/version.map | 3 +
7 files changed, 259 insertions(+), 6 deletions(-)
diff --git a/doc/guides/rel_notes/release_23_11.rst b/doc/guides/rel_notes/release_23_11.rst
index 4411bb32c1..02ee3867a0 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -72,6 +72,13 @@ New Features
Also, make sure to start the actual text at the margin.
=======================================================
+* **Add mbufs recycling support.**
+
+ Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+ APIs which allow the user to copy used mbufs from the Tx mbuf ring
+ into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+ device is different from the Tx Ethernet device with respective driver
+ callback functions in ``rte_eth_recycle_mbufs``.
Removed Items
-------------
@@ -123,6 +130,14 @@ ABI Changes
Also, make sure to start the actual text at the margin.
=======================================================
+* ethdev: Added ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields to ``rte_eth_dev`` structure.
+
+* ethdev: Structure ``rte_eth_fp_ops`` was affected to add
+ ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+ fields, to move ``rxq`` and ``txq`` fields, to change the size of
+ ``reserved1`` and ``reserved2`` fields.
+
Known Issues
------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 980f837ab6..b0c55a8523 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+ /** Pointer to PMD transmit mbufs reuse function */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ /** Pointer to PMD receive descriptors refill function */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
/**
* Device data that is shared between primary and secondary processes
@@ -507,6 +511,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
@@ -1250,6 +1258,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+ /** Retrieve mbufs recycle Rx queue information */
+ eth_recycle_rxq_info_get_t recycle_rxq_info_get;
eth_burst_mode_get_t rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+ fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+ fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptors_refill;
fpo->rxq.data = dev->data->rx_queues;
fpo->rxq.clbk = (void **)(uintptr_t)dev->post_rx_burst_cbs;
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 0840d2b594..ea89a101a1 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -5876,6 +5876,37 @@ rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
return 0;
}
+int
+rte_eth_recycle_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (queue_id >= dev->data->nb_rx_queues) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u\n", queue_id);
+ return -EINVAL;
+ }
+
+ if (dev->data->rx_queues == NULL ||
+ dev->data->rx_queues[queue_id] == NULL) {
+ RTE_ETHDEV_LOG(ERR,
+ "Rx queue %"PRIu16" of device with port_id=%"
+ PRIu16" has not been setup\n",
+ queue_id, port_id);
+ return -EINVAL;
+ }
+
+ if (*dev->dev_ops->recycle_rxq_info_get == NULL)
+ return -ENOTSUP;
+
+ dev->dev_ops->recycle_rxq_info_get(dev, queue_id, recycle_rxq_info);
+
+ return 0;
+}
+
int
rte_eth_rx_burst_mode_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_burst_mode *mode)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04a2564f22..9dc5749d83 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1820,6 +1820,30 @@ struct rte_eth_txq_info {
uint8_t queue_state; /**< one of RTE_ETH_QUEUE_STATE_*. */
} __rte_cache_min_aligned;
+/**
+ * @warning
+ * @b EXPERIMENTAL: this structure may change without prior notice.
+ *
+ * Ethernet device Rx queue information structure for recycling mbufs.
+ * Used to retrieve Rx queue information when Tx queue reusing mbufs and moving
+ * them into Rx mbuf ring.
+ */
+struct rte_eth_recycle_rxq_info {
+ struct rte_mbuf **mbuf_ring; /**< mbuf ring of Rx queue. */
+ struct rte_mempool *mp; /**< mempool of Rx queue. */
+ uint16_t *refill_head; /**< head of Rx queue refilling mbufs. */
+ uint16_t *receive_tail; /**< tail of Rx queue receiving pkts. */
+ uint16_t mbuf_ring_size; /**< configured number of mbuf ring size. */
+ /**
+ * Requirement on mbuf refilling batch size of Rx mbuf ring.
+ * For some PMD drivers, the number of Rx mbuf ring refilling mbufs
+ * should be aligned with mbuf ring size, in order to simplify
+ * ring wrapping around.
+ * Value 0 means that PMD drivers have no requirement for this.
+ */
+ uint16_t refill_requirement;
+} __rte_cache_min_aligned;
+
/* Generic Burst mode flag definition, values can be ORed. */
/**
@@ -4853,6 +4877,31 @@ int rte_eth_rx_queue_info_get(uint16_t port_id, uint16_t queue_id,
int rte_eth_tx_queue_info_get(uint16_t port_id, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Retrieve information about given ports's Rx queue for recycling mbufs.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param queue_id
+ * The Rx queue on the Ethernet devicefor which information
+ * will be retrieved.
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* to be filled.
+ *
+ * @return
+ * - 0: Success
+ * - -ENODEV: If *port_id* is invalid.
+ * - -ENOTSUP: routine is not supported by the device PMD.
+ * - -EINVAL: The queue_id is out of range.
+ */
+__rte_experimental
+int rte_eth_recycle_rx_queue_info_get(uint16_t port_id,
+ uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
/**
* Retrieve information about the Rx packet burst mode.
*
@@ -6527,6 +6576,138 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Recycle used mbufs from a transmit queue of an Ethernet device, and move
+ * these mbufs into a mbuf ring for a receive queue of an Ethernet device.
+ * This can bypass mempool path to save CPU cycles.
+ *
+ * The rte_eth_recycle_mbufs() function loops, with rte_eth_rx_burst() and
+ * rte_eth_tx_burst() functions, freeing Tx used mbufs and replenishing Rx
+ * descriptors. The number of recycling mbufs depends on the request of Rx mbuf
+ * ring, with the constraint of enough used mbufs from Tx mbuf ring.
+ *
+ * For each recycling mbufs, the rte_eth_recycle_mbufs() function performs the
+ * following operations:
+ *
+ * - Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf ring.
+ *
+ * - Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
+ * from the Tx mbuf ring.
+ *
+ * This function spilts Rx and Tx path with different callback functions. The
+ * callback function recycle_tx_mbufs_reuse is for Tx driver. The callback
+ * function recycle_rx_descriptors_refill is for Rx driver. rte_eth_recycle_mbufs()
+ * can support the case that Rx Ethernet device is different from Tx Ethernet device.
+ *
+ * It is the responsibility of users to select the Rx/Tx queue pair to recycle
+ * mbufs. Before call this function, users must call rte_eth_recycle_rxq_info_get
+ * function to retrieve selected Rx queue information.
+ * @see rte_eth_recycle_rxq_info_get, struct rte_eth_recycle_rxq_info
+ *
+ * Currently, the rte_eth_recycle_mbufs() function can support to feed 1 Rx queue from
+ * 2 Tx queues in the same thread. Do not pair the Rx queue and Tx queue in different
+ * threads, in order to avoid memory error rewriting.
+ *
+ * @param rx_port_id
+ * Port identifying the receive side.
+ * @param rx_queue_id
+ * The index of the receive queue identifying the receive side.
+ * The value must be in the range [0, nb_rx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param tx_port_id
+ * Port identifying the transmit side.
+ * @param tx_queue_id
+ * The index of the transmit queue identifying the transmit side.
+ * The value must be in the range [0, nb_tx_queue - 1] previously supplied
+ * to rte_eth_dev_configure().
+ * @param recycle_rxq_info
+ * A pointer to a structure of type *rte_eth_recycle_rxq_info* which contains
+ * the information of the Rx queue mbuf ring.
+ * @return
+ * The number of recycling mbufs.
+ */
+__rte_experimental
+static inline uint16_t
+rte_eth_recycle_mbufs(uint16_t rx_port_id, uint16_t rx_queue_id,
+ uint16_t tx_port_id, uint16_t tx_queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct rte_eth_fp_ops *p;
+ void *qd;
+ uint16_t nb_mbufs;
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ if (tx_port_id >= RTE_MAX_ETHPORTS ||
+ tx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR,
+ "Invalid tx_port_id=%u or tx_queue_id=%u\n",
+ tx_port_id, tx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[tx_port_id];
+ qd = p->txq.data[tx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_TX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(tx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Tx queue_id=%u for port_id=%u\n",
+ tx_queue_id, tx_port_id);
+ return 0;
+ }
+#endif
+ if (p->recycle_tx_mbufs_reuse == NULL)
+ return 0;
+
+ /* Copy used *rte_mbuf* buffer pointers from Tx mbuf ring
+ * into Rx mbuf ring.
+ */
+ nb_mbufs = p->recycle_tx_mbufs_reuse(qd, recycle_rxq_info);
+
+ /* If no recycling mbufs, return 0. */
+ if (nb_mbufs == 0)
+ return 0;
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ if (rx_port_id >= RTE_MAX_ETHPORTS ||
+ rx_queue_id >= RTE_MAX_QUEUES_PER_PORT) {
+ RTE_ETHDEV_LOG(ERR, "Invalid rx_port_id=%u or rx_queue_id=%u\n",
+ rx_port_id, rx_queue_id);
+ return 0;
+ }
+#endif
+
+ /* fetch pointer to queue data */
+ p = &rte_eth_fp_ops[rx_port_id];
+ qd = p->rxq.data[rx_queue_id];
+
+#ifdef RTE_ETHDEV_DEBUG_RX
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(rx_port_id, 0);
+
+ if (qd == NULL) {
+ RTE_ETHDEV_LOG(ERR, "Invalid Rx queue_id=%u for port_id=%u\n",
+ rx_queue_id, rx_port_id);
+ return 0;
+ }
+#endif
+
+ if (p->recycle_rx_descriptors_refill == NULL)
+ return 0;
+
+ /* Replenish the Rx descriptors with the recycling
+ * into Rx mbuf ring.
+ */
+ p->recycle_rx_descriptors_refill(qd, nb_mbufs);
+
+ return nb_mbufs;
+}
+
/**
* @warning
* @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/ethdev/rte_ethdev_core.h b/lib/ethdev/rte_ethdev_core.h
index 46e9721e07..a24ad7a6b2 100644
--- a/lib/ethdev/rte_ethdev_core.h
+++ b/lib/ethdev/rte_ethdev_core.h
@@ -55,6 +55,13 @@ typedef int (*eth_rx_descriptor_status_t)(void *rxq, uint16_t offset);
/** @internal Check the status of a Tx descriptor */
typedef int (*eth_tx_descriptor_status_t)(void *txq, uint16_t offset);
+/** @internal Copy used mbufs from Tx mbuf ring into Rx mbuf ring */
+typedef uint16_t (*eth_recycle_tx_mbufs_reuse_t)(void *txq,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
+/** @internal Refill Rx descriptors with the recycling mbufs */
+typedef void (*eth_recycle_rx_descriptors_refill_t)(void *rxq, uint16_t nb);
+
/**
* @internal
* Structure used to hold opaque pointers to internal ethdev Rx/Tx
@@ -83,15 +90,17 @@ struct rte_eth_fp_ops {
* Rx fast-path functions and related data.
* 64-bit systems: occupies first 64B line
*/
+ /** Rx queues data. */
+ struct rte_ethdev_qdata rxq;
/** PMD receive function. */
eth_rx_burst_t rx_pkt_burst;
/** Get the number of used Rx descriptors. */
eth_rx_queue_count_t rx_queue_count;
/** Check the status of a Rx descriptor. */
eth_rx_descriptor_status_t rx_descriptor_status;
- /** Rx queues data. */
- struct rte_ethdev_qdata rxq;
- uintptr_t reserved1[3];
+ /** Refill Rx descriptors with the recycling mbufs. */
+ eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
+ uintptr_t reserved1[2];
/**@}*/
/**@{*/
@@ -99,15 +108,17 @@ struct rte_eth_fp_ops {
* Tx fast-path functions and related data.
* 64-bit systems: occupies second 64B line
*/
+ /** Tx queues data. */
+ struct rte_ethdev_qdata txq;
/** PMD transmit function. */
eth_tx_burst_t tx_pkt_burst;
/** PMD transmit prepare function. */
eth_tx_prep_t tx_pkt_prepare;
/** Check the status of a Tx descriptor. */
eth_tx_descriptor_status_t tx_descriptor_status;
- /** Tx queues data. */
- struct rte_ethdev_qdata txq;
- uintptr_t reserved2[3];
+ /** Copy used mbufs from Tx mbuf ring into Rx. */
+ eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+ uintptr_t reserved2[2];
/**@}*/
} __rte_cache_aligned;
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index b965d6aa52..eec159dfdd 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -312,6 +312,9 @@ EXPERIMENTAL {
rte_flow_async_action_list_handle_query_update;
rte_flow_async_actions_update;
rte_flow_restore_info_dynflag;
+
+ # added in 23.11
+ rte_eth_recycle_rx_queue_info_get;
};
INTERNAL {
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
2023-08-22 7:27 ` [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-22 7:27 ` [PATCH v11 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
@ 2023-08-22 7:27 ` Feifei Wang
2023-08-22 23:43 ` Konstantin Ananyev
2023-08-24 6:10 ` Feifei Wang
2023-08-22 7:27 ` [PATCH v11 3/4] net/ixgbe: " Feifei Wang
` (3 subsequent siblings)
5 siblings, 2 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-22 7:27 UTC (permalink / raw)
To: Yuying Zhang, Beilei Xing
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/i40e/i40e_ethdev.c | 1 +
drivers/net/i40e/i40e_ethdev.h | 2 +
.../net/i40e/i40e_recycle_mbufs_vec_common.c | 147 ++++++++++++++++++
drivers/net/i40e/i40e_rxtx.c | 32 ++++
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/meson.build | 1 +
6 files changed, 187 insertions(+)
create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 8271bbb394..50ba9aac94 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+ .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
.rx_burst_mode_get = i40e_rx_burst_mode_get,
.tx_burst_mode_get = i40e_tx_burst_mode_get,
.timesync_enable = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 6f65d5e0ac..af758798e1 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_rxq_info *qinfo);
void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_burst_mode *mode);
int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..5663ecccde
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct i40e_rx_queue *rxq = rx_queue;
+ struct i40e_rx_entry *rxep;
+ volatile union i40e_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* flush desc with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ rx_id = rxq->rxrearm_start - 1;
+
+ if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+ rxq->rxrearm_start = 0;
+ rx_id = rxq->nb_rx_desc - 1;
+ }
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rte_io_wmb();
+ /* Update the tail pointer on the NIC */
+ I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_tx_queue *txq = tx_queue;
+ struct i40e_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
+ rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
+ rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle mode can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * mbuf ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with mbuf ring size. In this case, the update of refill head can not
+ * exceed the Rx mbuf ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index b4f65b58fa..a9c9eb331c 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -3199,6 +3199,30 @@ i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.offloads = txq->offloads;
}
+void
+i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct i40e_rx_queue *rxq;
+ struct i40e_adapter *ad =
+ I40E_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mp;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (ad->rx_vec_allowed) {
+ recycle_rxq_info->refill_requirement = RTE_I40E_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
#ifdef RTE_ARCH_X86
static inline bool
get_avx_supported(bool request_avx512)
@@ -3293,6 +3317,8 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_scattered_pkts_vec_avx2 :
i40e_recv_scattered_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
} else {
if (ad->rx_use_avx512) {
@@ -3311,9 +3337,12 @@ i40e_set_rx_function(struct rte_eth_dev *dev)
dev->rx_pkt_burst = ad->rx_use_avx2 ?
i40e_recv_pkts_vec_avx2 :
i40e_recv_pkts_vec;
+ dev->recycle_rx_descriptors_refill =
+ i40e_recycle_rx_descriptors_refill_vec;
}
}
#else /* RTE_ARCH_X86 */
+ dev->recycle_rx_descriptors_refill = i40e_recycle_rx_descriptors_refill_vec;
if (dev->data->scattered_rx) {
PMD_INIT_LOG(DEBUG,
"Using Vector Scattered Rx (port %d).",
@@ -3481,15 +3510,18 @@ i40e_set_tx_function(struct rte_eth_dev *dev)
dev->tx_pkt_burst = ad->tx_use_avx2 ?
i40e_xmit_pkts_vec_avx2 :
i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
#else /* RTE_ARCH_X86 */
PMD_INIT_LOG(DEBUG, "Using Vector Tx (port %d).",
dev->data->port_id);
dev->tx_pkt_burst = i40e_xmit_pkts_vec;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
#endif /* RTE_ARCH_X86 */
} else {
PMD_INIT_LOG(DEBUG, "Simple tx finally be used.");
dev->tx_pkt_burst = i40e_xmit_pkts_simple;
+ dev->recycle_tx_mbufs_reuse = i40e_recycle_tx_mbufs_reuse_vec;
}
dev->tx_pkt_prepare = i40e_simple_prep_pkts;
} else {
diff --git a/drivers/net/i40e/i40e_rxtx.h b/drivers/net/i40e/i40e_rxtx.h
index a8686224e5..b191f23e1f 100644
--- a/drivers/net/i40e/i40e_rxtx.h
+++ b/drivers/net/i40e/i40e_rxtx.h
@@ -236,6 +236,10 @@ uint32_t i40e_dev_rx_queue_count(void *rx_queue);
int i40e_dev_rx_descriptor_status(void *rx_queue, uint16_t offset);
int i40e_dev_tx_descriptor_status(void *tx_queue, uint16_t offset);
+uint16_t i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+void i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs);
+
uint16_t i40e_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
uint16_t nb_pkts);
uint16_t i40e_recv_scattered_pkts_vec(void *rx_queue,
diff --git a/drivers/net/i40e/meson.build b/drivers/net/i40e/meson.build
index 8e53b87a65..3b1a233c84 100644
--- a/drivers/net/i40e/meson.build
+++ b/drivers/net/i40e/meson.build
@@ -34,6 +34,7 @@ sources = files(
'i40e_tm.c',
'i40e_hash.c',
'i40e_vf_representor.c',
+ 'i40e_recycle_mbufs_vec_common.c',
'rte_pmd_i40e.c',
)
--
2.25.1
^ permalink raw reply [flat|nested] 145+ messages in thread
* [PATCH v11 3/4] net/ixgbe: implement mbufs recycle mode
2023-08-22 7:27 ` [PATCH v11 0/4] Recycle mbufs from Tx queue into Rx queue Feifei Wang
2023-08-22 7:27 ` [PATCH v11 1/4] ethdev: add API for mbufs recycle mode Feifei Wang
2023-08-22 7:27 ` [PATCH v11 2/4] net/i40e: implement " Feifei Wang
@ 2023-08-22 7:27 ` Feifei Wang
2023-08-22 7:27 ` [PATCH v11 4/4] app/testpmd: add recycle mbufs engine Feifei Wang
` (2 subsequent siblings)
5 siblings, 0 replies; 145+ messages in thread
From: Feifei Wang @ 2023-08-22 7:27 UTC (permalink / raw)
To: Qiming Yang, Wenjun Wu
Cc: dev, nd, Feifei Wang, Honnappa Nagarahalli, Ruifeng Wang
Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
drivers/net/ixgbe/ixgbe_ethdev.c | 1 +
drivers/net/ixgbe/ixgbe_ethdev.h | 3 +
.../ixgbe/ixgbe_recycle_mbufs_vec_common.c | 143 ++++++++++++++++++
drivers/net/ixgbe/ixgbe_rxtx.c | 37 ++++-
drivers/net/ixgbe/ixgbe_rxtx.h | 4 +
drivers/net/ixgbe/meson.build | 2 +
6 files changed, 188 insertions(+), 2 deletions(-)
create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 14a7d571e0..ea4c9dd561 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+ .recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
.timesync_enable = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 1291e9099c..22fc3be3d8 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -626,6 +626,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 0000000000..9a8cc86954
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include <stdint.h>
+#include <ethdev_driver.h>
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+ struct ixgbe_rx_queue *rxq = rx_queue;
+ struct ixgbe_rx_entry *rxep;
+ volatile union ixgbe_adv_rx_desc *rxdp;
+ uint16_t rx_id;
+ uint64_t paddr;
+ uint64_t dma_addr;
+ uint16_t i;
+
+ rxdp = rxq->rx_ring + rxq->rxrearm_start;
+ rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+ for (i = 0; i < nb_mbufs; i++) {
+ /* Initialize rxdp descs. */
+ paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+ dma_addr = rte_cpu_to_le_64(paddr);
+ /* Flush descriptors with pa dma_addr */
+ rxdp[i].read.hdr_addr = 0;
+ rxdp[i].read.pkt_addr = dma_addr;
+ }
+
+ /* Update the descriptor initializer index */
+ rxq->rxrearm_start += nb_mbufs;
+ if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+ rxq->rxrearm_start = 0;
+
+ rxq->rxrearm_nb -= nb_mbufs;
+
+ rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+ (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+ /* Update the tail pointer on the NIC */
+ IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_tx_queue *txq = tx_queue;
+ struct ixgbe_tx_entry *txep;
+ struct rte_mbuf **rxep;
+ int i, n;
+ uint32_t status;
+ uint16_t nb_recycle_mbufs;
+ uint16_t avail = 0;
+ uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+ uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+ uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+ uint16_t refill_head = *recycle_rxq_info->refill_head;
+ uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+ /* Get available recycling Rx buffers. */
+ avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+ /* Check Tx free thresh and Rx available space. */
+ if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+ return 0;
+
+ /* check DD bits on threshold descriptor */
+ status = txq->tx_ring[txq->tx_next_dd].wb.status;
+ if (!(status & IXGBE_ADVTXD_STAT_DD))
+ return 0;
+
+ n = txq->tx_rs_thresh;
+ nb_recycle_mbufs = n;
+
+ /* Mbufs recycle can only support no ring buffer wrapping around.
+ * Two case for this:
+ *
+ * case 1: The refill head of Rx buffer ring needs to be aligned with
+ * buffer ring size. In this case, the number of Tx freeing buffers
+ * should be equal to refill_requirement.
+ *
+ * case 2: The refill head of Rx ring buffer does not need to be aligned
+ * with buffer ring size. In this case, the update of refill head can not
+ * exceed the Rx buffer ring size.
+ */
+ if (refill_requirement != n ||
+ (!refill_requirement && (refill_head + n > mbuf_ring_size)))
+ return 0;
+
+ /* First buffer to free from S/W ring is at index
+ * tx_next_dd - (tx_rs_thresh-1).
+ */
+ txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+ rxep = recycle_rxq_info->mbuf_ring;
+ rxep += refill_head;
+
+ if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
+ /* Avoid txq contains buffers from unexpected mempool. */
+ if (unlikely(recycle_rxq_info->mp
+ != txep[0].mbuf->pool))
+ return 0;
+
+ /* Directly put mbufs from Tx to Rx. */
+ for (i = 0; i < n; i++)
+ rxep[i] = txep[i].mbuf;
+ } else {
+ for (i = 0; i < n; i++) {
+ rxep[i] = rte_pktmbuf_prefree_seg(txep[i].mbuf);
+
+ /* If Tx buffers are not the last reference or from
+ * unexpected mempool, previous copied buffers are
+ * considered as invalid.
+ */
+ if (unlikely((rxep[i] == NULL && refill_requirement) ||
+ recycle_rxq_info->mp != txep[i].mbuf->pool))
+ nb_recycle_mbufs = 0;
+ }
+ /* If Tx buffers are not the last reference or
+ * from unexpected mempool, all recycled buffers
+ * are put into mempool.
+ */
+ if (nb_recycle_mbufs == 0)
+ for (i = 0; i < n; i++) {
+ if (rxep[i] != NULL)
+ rte_mempool_put(rxep[i]->pool, rxep[i]);
+ }
+ }
+
+ /* Update counters for Tx. */
+ txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
+ txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
+ if (txq->tx_next_dd >= txq->nb_tx_desc)
+ txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
+
+ return nb_recycle_mbufs;
+}
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 954ef241a0..90b0a7004f 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2552,6 +2552,9 @@ ixgbe_set_tx_function(struct rte_eth_dev *dev, struct ixgbe_tx_queue *txq)
(rte_eal_process_type() != RTE_PROC_PRIMARY ||
ixgbe_txq_vec_setup(txq) == 0)) {
PMD_INIT_LOG(DEBUG, "Vector tx enabled.");
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_tx_mbufs_reuse = ixgbe_recycle_tx_mbufs_reuse_vec;
+#endif
dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
} else
dev->tx_pkt_burst = ixgbe_xmit_pkts_simple;
@@ -4890,7 +4893,10 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
PMD_INIT_LOG(DEBUG, "Using Vector Scattered Rx "
"callback (port=%d).",
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill =
+ ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Using a Scattered with bulk "
@@ -4919,7 +4925,9 @@ ixgbe_set_rx_function(struct rte_eth_dev *dev)
"burst size no less than %d (port=%d).",
RTE_IXGBE_DESCS_PER_LOOP,
dev->data->port_id);
-
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ dev->recycle_rx_descriptors_refill = ixgbe_recycle_rx_descriptors_refill_vec;
+#endif
dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
} else if (adapter->rx_bulk_alloc_allowed) {
PMD_INIT_LOG(DEBUG, "Rx Burst Bulk Alloc Preconditions are "
@@ -5691,6 +5699,31 @@ ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
qinfo->conf.tx_deferred_start = txq->tx_deferred_start;
}
+void
+ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+ struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+ struct ixgbe_rx_queue *rxq;
+ struct ixgbe_adapter *adapter = dev->data->dev_private;
+
+ rxq = dev->data->rx_queues[queue_id];
+
+ recycle_rxq_info->mbuf_ring = (void *)rxq->sw_ring;
+ recycle_rxq_info->mp = rxq->mb_pool;
+ recycle_rxq_info->mbuf_ring_size = rxq->nb_rx_desc;
+ recycle_rxq_info->receive_tail = &rxq->rx_tail;
+
+ if (adapter->rx_vec_allowed) {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM)
+ recycle_rxq_info->refill_requirement = RTE_IXGBE_RXQ_REARM_THRESH;
+ recycle_rxq_info->refill_head = &rxq->rxrearm_start;
+#endif
+ } else {
+ recycle_rxq_info->refill_requirement = rxq->rx_free_thresh;
+ recycle_rxq_info->refill_head = &rxq->rx_free_trigger;
+ }
+}
+
/*
* [VF] Initializes Receive Unit.
*/