patches for DPDK stable branches
 help / color / mirror / Atom feed
* [PATCH 0/5] net/hns3: some performance optimizations
@ 2023-07-11 10:24 Dongdong Liu
  2023-07-11 10:24 ` [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx Dongdong Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

This patchset is to do some performance optimizations for hns3.

Huisong Li (5):
  net/hns3: fix incorrect index to look up table in NEON Rx
  net/hns3: fix the order of NEON Rx code
  net/hns3: optimize free mbuf code for SVE Tx
  net/hns3: optimize the rearm mbuf function for SVE Rx
  net/hns3: optimize SVE Rx performance

 drivers/net/hns3/hns3_rxtx_vec.c      |  51 ------
 drivers/net/hns3/hns3_rxtx_vec.h      |  51 ++++++
 drivers/net/hns3/hns3_rxtx_vec_neon.h |  82 ++++-----
 drivers/net/hns3/hns3_rxtx_vec_sve.c  | 230 ++++----------------------
 4 files changed, 114 insertions(+), 300 deletions(-)

--
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
@ 2023-07-11 10:24 ` Dongdong Liu
  2023-07-11 12:58   ` Ferruh Yigit
  2023-07-11 10:24 ` [PATCH 2/5] net/hns3: fix the order of NEON Rx code Dongdong Liu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

From: Huisong Li <lihuisong@huawei.com>

In hns3_recv_burst_vec(), the index to get packet length and data
size are reversed. Fortunately, this doesn't affect functionality
because the NEON Rx only supports single BD in which the packet
length is equal to the date size. Now this patch fixes it to get
back to the truth.

Fixes: a3d4f4d291d7 ("net/hns3: support NEON Rx")
Cc: stable@dpdk.org

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_rxtx_vec_neon.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hns3/hns3_rxtx_vec_neon.h b/drivers/net/hns3/hns3_rxtx_vec_neon.h
index 6c49c70fc7..564d831a48 100644
--- a/drivers/net/hns3/hns3_rxtx_vec_neon.h
+++ b/drivers/net/hns3/hns3_rxtx_vec_neon.h
@@ -142,8 +142,8 @@ hns3_recv_burst_vec(struct hns3_rx_queue *__restrict rxq,
 	/* mask to shuffle from desc to mbuf's rx_descriptor_fields1 */
 	uint8x16_t shuf_desc_fields_msk = {
 		0xff, 0xff, 0xff, 0xff,  /* packet type init zero */
-		22, 23, 0xff, 0xff,      /* rx.pkt_len to rte_mbuf.pkt_len */
-		20, 21,	                 /* size to rte_mbuf.data_len */
+		20, 21, 0xff, 0xff,      /* rx.pkt_len to rte_mbuf.pkt_len */
+		22, 23,	                 /* size to rte_mbuf.data_len */
 		0xff, 0xff,	         /* rte_mbuf.vlan_tci init zero */
 		8, 9, 10, 11,	         /* rx.rss_hash to rte_mbuf.hash.rss */
 	};
-- 
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/5] net/hns3: fix the order of NEON Rx code
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
  2023-07-11 10:24 ` [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx Dongdong Liu
@ 2023-07-11 10:24 ` Dongdong Liu
  2023-07-11 10:24 ` [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx Dongdong Liu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

From: Huisong Li <lihuisong@huawei.com>

This patch reorders the order of the NEON Rx for better maintenance
and easier understanding.

Fixes: a3d4f4d291d7 ("net/hns3: support NEON Rx")
Cc: stable@dpdk.org

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_rxtx_vec_neon.h | 78 +++++++++++----------------
 1 file changed, 31 insertions(+), 47 deletions(-)

diff --git a/drivers/net/hns3/hns3_rxtx_vec_neon.h b/drivers/net/hns3/hns3_rxtx_vec_neon.h
index 564d831a48..0dc6b9f0a2 100644
--- a/drivers/net/hns3/hns3_rxtx_vec_neon.h
+++ b/drivers/net/hns3/hns3_rxtx_vec_neon.h
@@ -180,19 +180,12 @@ hns3_recv_burst_vec(struct hns3_rx_queue *__restrict rxq,
 		bd_vld = vset_lane_u16(rxdp[2].rx.bdtype_vld_udp0, bd_vld, 2);
 		bd_vld = vset_lane_u16(rxdp[3].rx.bdtype_vld_udp0, bd_vld, 3);
 
-		/* load 2 mbuf pointer */
-		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
-
 		bd_vld = vshl_n_u16(bd_vld,
 				    HNS3_UINT16_BIT - 1 - HNS3_RXD_VLD_B);
 		bd_vld = vreinterpret_u16_s16(
 				vshr_n_s16(vreinterpret_s16_u16(bd_vld),
 					   HNS3_UINT16_BIT - 1));
 		stat = ~vget_lane_u64(vreinterpret_u64_u16(bd_vld), 0);
-
-		/* load 2 mbuf pointer again */
-		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
-
 		if (likely(stat == 0))
 			bd_valid_num = HNS3_DEFAULT_DESCS_PER_LOOP;
 		else
@@ -200,20 +193,20 @@ hns3_recv_burst_vec(struct hns3_rx_queue *__restrict rxq,
 		if (bd_valid_num == 0)
 			break;
 
-		/* use offset to control below data load oper ordering */
-		offset = rxq->offset_table[bd_valid_num];
+		/* load 4 mbuf pointer */
+		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
+		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
-		/* store 2 mbuf pointer into rx_pkts */
+		/* store 4 mbuf pointer into rx_pkts */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
+		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
 
-		/* read first two descs */
+		/* use offset to control below data load oper ordering */
+		offset = rxq->offset_table[bd_valid_num];
+
+		/* read 4 descs */
 		descs[0] = vld2q_u64((uint64_t *)(rxdp + offset));
 		descs[1] = vld2q_u64((uint64_t *)(rxdp + offset + 1));
-
-		/* store 2 mbuf pointer into rx_pkts again */
-		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
-
-		/* read remains two descs */
 		descs[2] = vld2q_u64((uint64_t *)(rxdp + offset + 2));
 		descs[3] = vld2q_u64((uint64_t *)(rxdp + offset + 3));
 
@@ -221,56 +214,47 @@ hns3_recv_burst_vec(struct hns3_rx_queue *__restrict rxq,
 		pkt_mbuf1.val[1] = vreinterpretq_u8_u64(descs[0].val[1]);
 		pkt_mbuf2.val[0] = vreinterpretq_u8_u64(descs[1].val[0]);
 		pkt_mbuf2.val[1] = vreinterpretq_u8_u64(descs[1].val[1]);
+		pkt_mbuf3.val[0] = vreinterpretq_u8_u64(descs[2].val[0]);
+		pkt_mbuf3.val[1] = vreinterpretq_u8_u64(descs[2].val[1]);
+		pkt_mbuf4.val[0] = vreinterpretq_u8_u64(descs[3].val[0]);
+		pkt_mbuf4.val[1] = vreinterpretq_u8_u64(descs[3].val[1]);
 
-		/* pkt 1,2 convert format from desc to pktmbuf */
+		/* 4 packets convert format from desc to pktmbuf */
 		pkt_mb1 = vqtbl2q_u8(pkt_mbuf1, shuf_desc_fields_msk);
 		pkt_mb2 = vqtbl2q_u8(pkt_mbuf2, shuf_desc_fields_msk);
+		pkt_mb3 = vqtbl2q_u8(pkt_mbuf3, shuf_desc_fields_msk);
+		pkt_mb4 = vqtbl2q_u8(pkt_mbuf4, shuf_desc_fields_msk);
 
-		/* store the first 8 bytes of pkt 1,2 mbuf's rearm_data */
-		*(uint64_t *)&sw_ring[pos + 0].mbuf->rearm_data =
-			rxq->mbuf_initializer;
-		*(uint64_t *)&sw_ring[pos + 1].mbuf->rearm_data =
-			rxq->mbuf_initializer;
-
-		/* pkt 1,2 remove crc */
+		/* 4 packets remove crc */
 		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb1), crc_adjust);
 		pkt_mb1 = vreinterpretq_u8_u16(tmp);
 		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb2), crc_adjust);
 		pkt_mb2 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
+		pkt_mb3 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
+		pkt_mb4 = vreinterpretq_u8_u16(tmp);
 
-		pkt_mbuf3.val[0] = vreinterpretq_u8_u64(descs[2].val[0]);
-		pkt_mbuf3.val[1] = vreinterpretq_u8_u64(descs[2].val[1]);
-		pkt_mbuf4.val[0] = vreinterpretq_u8_u64(descs[3].val[0]);
-		pkt_mbuf4.val[1] = vreinterpretq_u8_u64(descs[3].val[1]);
-
-		/* pkt 3,4 convert format from desc to pktmbuf */
-		pkt_mb3 = vqtbl2q_u8(pkt_mbuf3, shuf_desc_fields_msk);
-		pkt_mb4 = vqtbl2q_u8(pkt_mbuf4, shuf_desc_fields_msk);
-
-		/* pkt 1,2 save to rx_pkts mbuf */
+		/* save packet info to rx_pkts mbuf */
 		vst1q_u8((void *)&sw_ring[pos + 0].mbuf->rx_descriptor_fields1,
 			 pkt_mb1);
 		vst1q_u8((void *)&sw_ring[pos + 1].mbuf->rx_descriptor_fields1,
 			 pkt_mb2);
+		vst1q_u8((void *)&sw_ring[pos + 2].mbuf->rx_descriptor_fields1,
+			 pkt_mb3);
+		vst1q_u8((void *)&sw_ring[pos + 3].mbuf->rx_descriptor_fields1,
+			 pkt_mb4);
 
-		/* pkt 3,4 remove crc */
-		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
-		pkt_mb3 = vreinterpretq_u8_u16(tmp);
-		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
-		pkt_mb4 = vreinterpretq_u8_u16(tmp);
-
-		/* store the first 8 bytes of pkt 3,4 mbuf's rearm_data */
+		/* store the first 8 bytes of packets mbuf's rearm_data */
+		*(uint64_t *)&sw_ring[pos + 0].mbuf->rearm_data =
+			rxq->mbuf_initializer;
+		*(uint64_t *)&sw_ring[pos + 1].mbuf->rearm_data =
+			rxq->mbuf_initializer;
 		*(uint64_t *)&sw_ring[pos + 2].mbuf->rearm_data =
 			rxq->mbuf_initializer;
 		*(uint64_t *)&sw_ring[pos + 3].mbuf->rearm_data =
 			rxq->mbuf_initializer;
 
-		/* pkt 3,4 save to rx_pkts mbuf */
-		vst1q_u8((void *)&sw_ring[pos + 2].mbuf->rx_descriptor_fields1,
-			 pkt_mb3);
-		vst1q_u8((void *)&sw_ring[pos + 3].mbuf->rx_descriptor_fields1,
-			 pkt_mb4);
-
 		rte_prefetch_non_temporal(rxdp + HNS3_DEFAULT_DESCS_PER_LOOP);
 
 		parse_retcode = hns3_desc_parse_field(rxq, &sw_ring[pos],
-- 
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
  2023-07-11 10:24 ` [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx Dongdong Liu
  2023-07-11 10:24 ` [PATCH 2/5] net/hns3: fix the order of NEON Rx code Dongdong Liu
@ 2023-07-11 10:24 ` Dongdong Liu
  2023-09-25 14:21   ` Ferruh Yigit
  2023-07-11 10:24 ` [PATCH 4/5] net/hns3: optimize the rearm mbuf function for SVE Rx Dongdong Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

From: Huisong Li <lihuisong@huawei.com>

Currently, hns3 SVE Tx checks the valid bits of all descriptors
in a batch and then determines whether to release the corresponding
mbufs. Actually, once the valid bit of any descriptor in a batch
isn't cleared, driver does not need to scan the rest of descriptors.

If we optimize SVE codes algorithm about this function, the performance
of a single queue for 64B packet is improved by ~2% on txonly forwarding
mode. And if use C code to scan all descriptors, the performance is
improved by ~8%.

So this patch selects C code to optimize this code to improve the SVE
Tx performance.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_rxtx_vec_sve.c | 42 +---------------------------
 1 file changed, 1 insertion(+), 41 deletions(-)

diff --git a/drivers/net/hns3/hns3_rxtx_vec_sve.c b/drivers/net/hns3/hns3_rxtx_vec_sve.c
index 8bfc3de049..5011544e07 100644
--- a/drivers/net/hns3/hns3_rxtx_vec_sve.c
+++ b/drivers/net/hns3/hns3_rxtx_vec_sve.c
@@ -337,46 +337,6 @@ hns3_recv_pkts_vec_sve(void *__restrict rx_queue,
 	return nb_rx;
 }
 
-static inline void
-hns3_tx_free_buffers_sve(struct hns3_tx_queue *txq)
-{
-#define HNS3_SVE_CHECK_DESCS_PER_LOOP	8
-#define TX_VLD_U8_ZIP_INDEX		svindex_u8(0, 4)
-	svbool_t pg32 = svwhilelt_b32(0, HNS3_SVE_CHECK_DESCS_PER_LOOP);
-	svuint32_t vld, vld2;
-	svuint8_t vld_u8;
-	uint64_t vld_all;
-	struct hns3_desc *tx_desc;
-	int i;
-
-	/*
-	 * All mbufs can be released only when the VLD bits of all
-	 * descriptors in a batch are cleared.
-	 */
-	/* do logical OR operation for all desc's valid field */
-	vld = svdup_n_u32(0);
-	tx_desc = &txq->tx_ring[txq->next_to_clean];
-	for (i = 0; i < txq->tx_rs_thresh; i += HNS3_SVE_CHECK_DESCS_PER_LOOP,
-				tx_desc += HNS3_SVE_CHECK_DESCS_PER_LOOP) {
-		vld2 = svld1_gather_u32offset_u32(pg32, (uint32_t *)tx_desc,
-				svindex_u32(BD_FIELD_VALID_OFFSET, BD_SIZE));
-		vld = svorr_u32_z(pg32, vld, vld2);
-	}
-	/* shift left and then right to get all valid bit */
-	vld = svlsl_n_u32_z(pg32, vld,
-			    HNS3_UINT32_BIT - 1 - HNS3_TXD_VLD_B);
-	vld = svreinterpret_u32_s32(svasr_n_s32_z(pg32,
-		svreinterpret_s32_u32(vld), HNS3_UINT32_BIT - 1));
-	/* use tbl to compress 32bit-lane to 8bit-lane */
-	vld_u8 = svtbl_u8(svreinterpret_u8_u32(vld), TX_VLD_U8_ZIP_INDEX);
-	/* dump compressed 64bit to variable */
-	svst1_u64(PG64_64BIT, &vld_all, svreinterpret_u64_u8(vld_u8));
-	if (vld_all > 0)
-		return;
-
-	hns3_tx_bulk_free_buffers(txq);
-}
-
 static inline void
 hns3_tx_fill_hw_ring_sve(struct hns3_tx_queue *txq,
 			 struct rte_mbuf **pkts,
@@ -462,7 +422,7 @@ hns3_xmit_fixed_burst_vec_sve(void *__restrict tx_queue,
 	uint16_t nb_tx = 0;
 
 	if (txq->tx_bd_ready < txq->tx_free_thresh)
-		hns3_tx_free_buffers_sve(txq);
+		hns3_tx_free_buffers(txq);
 
 	nb_pkts = RTE_MIN(txq->tx_bd_ready, nb_pkts);
 	if (unlikely(nb_pkts == 0)) {
-- 
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/5] net/hns3: optimize the rearm mbuf function for SVE Rx
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
                   ` (2 preceding siblings ...)
  2023-07-11 10:24 ` [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx Dongdong Liu
@ 2023-07-11 10:24 ` Dongdong Liu
  2023-07-11 10:24 ` [PATCH 5/5] net/hns3: optimize SVE Rx performance Dongdong Liu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

From: Huisong Li <lihuisong@huawei.com>

Use hns3_rxq_rearm_mbuf() to replace the hns3_rxq_rearm_mbuf_sve()
to optimize the performance of SVE Rx.

On the rxonly forwarding mode, the performance of a single queue
for 64B packet is improved by ~15%.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_rxtx_vec.c     | 51 ---------------------------
 drivers/net/hns3/hns3_rxtx_vec.h     | 51 +++++++++++++++++++++++++++
 drivers/net/hns3/hns3_rxtx_vec_sve.c | 52 ++--------------------------
 3 files changed, 53 insertions(+), 101 deletions(-)

diff --git a/drivers/net/hns3/hns3_rxtx_vec.c b/drivers/net/hns3/hns3_rxtx_vec.c
index cd9264d91b..9708ec614e 100644
--- a/drivers/net/hns3/hns3_rxtx_vec.c
+++ b/drivers/net/hns3/hns3_rxtx_vec.c
@@ -55,57 +55,6 @@ hns3_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	return nb_tx;
 }
 
-static inline void
-hns3_rxq_rearm_mbuf(struct hns3_rx_queue *rxq)
-{
-#define REARM_LOOP_STEP_NUM	4
-	struct hns3_entry *rxep = &rxq->sw_ring[rxq->rx_rearm_start];
-	struct hns3_desc *rxdp = rxq->rx_ring + rxq->rx_rearm_start;
-	uint64_t dma_addr;
-	int i;
-
-	if (unlikely(rte_mempool_get_bulk(rxq->mb_pool, (void *)rxep,
-					  HNS3_DEFAULT_RXQ_REARM_THRESH) < 0)) {
-		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
-		return;
-	}
-
-	for (i = 0; i < HNS3_DEFAULT_RXQ_REARM_THRESH; i += REARM_LOOP_STEP_NUM,
-		rxep += REARM_LOOP_STEP_NUM, rxdp += REARM_LOOP_STEP_NUM) {
-		if (likely(i <
-			HNS3_DEFAULT_RXQ_REARM_THRESH - REARM_LOOP_STEP_NUM)) {
-			rte_prefetch_non_temporal(rxep[4].mbuf);
-			rte_prefetch_non_temporal(rxep[5].mbuf);
-			rte_prefetch_non_temporal(rxep[6].mbuf);
-			rte_prefetch_non_temporal(rxep[7].mbuf);
-		}
-
-		dma_addr = rte_mbuf_data_iova_default(rxep[0].mbuf);
-		rxdp[0].addr = rte_cpu_to_le_64(dma_addr);
-		rxdp[0].rx.bd_base_info = 0;
-
-		dma_addr = rte_mbuf_data_iova_default(rxep[1].mbuf);
-		rxdp[1].addr = rte_cpu_to_le_64(dma_addr);
-		rxdp[1].rx.bd_base_info = 0;
-
-		dma_addr = rte_mbuf_data_iova_default(rxep[2].mbuf);
-		rxdp[2].addr = rte_cpu_to_le_64(dma_addr);
-		rxdp[2].rx.bd_base_info = 0;
-
-		dma_addr = rte_mbuf_data_iova_default(rxep[3].mbuf);
-		rxdp[3].addr = rte_cpu_to_le_64(dma_addr);
-		rxdp[3].rx.bd_base_info = 0;
-	}
-
-	rxq->rx_rearm_start += HNS3_DEFAULT_RXQ_REARM_THRESH;
-	if (rxq->rx_rearm_start >= rxq->nb_rx_desc)
-		rxq->rx_rearm_start = 0;
-
-	rxq->rx_rearm_nb -= HNS3_DEFAULT_RXQ_REARM_THRESH;
-
-	hns3_write_reg_opt(rxq->io_head_reg, HNS3_DEFAULT_RXQ_REARM_THRESH);
-}
-
 uint16_t
 hns3_recv_pkts_vec(void *__restrict rx_queue,
 		   struct rte_mbuf **__restrict rx_pkts,
diff --git a/drivers/net/hns3/hns3_rxtx_vec.h b/drivers/net/hns3/hns3_rxtx_vec.h
index 2c8a91921e..a9a6774294 100644
--- a/drivers/net/hns3/hns3_rxtx_vec.h
+++ b/drivers/net/hns3/hns3_rxtx_vec.h
@@ -94,4 +94,55 @@ hns3_rx_reassemble_pkts(struct rte_mbuf **rx_pkts,
 
 	return count;
 }
+
+static inline void
+hns3_rxq_rearm_mbuf(struct hns3_rx_queue *rxq)
+{
+#define REARM_LOOP_STEP_NUM	4
+	struct hns3_entry *rxep = &rxq->sw_ring[rxq->rx_rearm_start];
+	struct hns3_desc *rxdp = rxq->rx_ring + rxq->rx_rearm_start;
+	uint64_t dma_addr;
+	int i;
+
+	if (unlikely(rte_mempool_get_bulk(rxq->mb_pool, (void *)rxep,
+					  HNS3_DEFAULT_RXQ_REARM_THRESH) < 0)) {
+		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
+		return;
+	}
+
+	for (i = 0; i < HNS3_DEFAULT_RXQ_REARM_THRESH; i += REARM_LOOP_STEP_NUM,
+		rxep += REARM_LOOP_STEP_NUM, rxdp += REARM_LOOP_STEP_NUM) {
+		if (likely(i <
+			HNS3_DEFAULT_RXQ_REARM_THRESH - REARM_LOOP_STEP_NUM)) {
+			rte_prefetch_non_temporal(rxep[4].mbuf);
+			rte_prefetch_non_temporal(rxep[5].mbuf);
+			rte_prefetch_non_temporal(rxep[6].mbuf);
+			rte_prefetch_non_temporal(rxep[7].mbuf);
+		}
+
+		dma_addr = rte_mbuf_data_iova_default(rxep[0].mbuf);
+		rxdp[0].addr = rte_cpu_to_le_64(dma_addr);
+		rxdp[0].rx.bd_base_info = 0;
+
+		dma_addr = rte_mbuf_data_iova_default(rxep[1].mbuf);
+		rxdp[1].addr = rte_cpu_to_le_64(dma_addr);
+		rxdp[1].rx.bd_base_info = 0;
+
+		dma_addr = rte_mbuf_data_iova_default(rxep[2].mbuf);
+		rxdp[2].addr = rte_cpu_to_le_64(dma_addr);
+		rxdp[2].rx.bd_base_info = 0;
+
+		dma_addr = rte_mbuf_data_iova_default(rxep[3].mbuf);
+		rxdp[3].addr = rte_cpu_to_le_64(dma_addr);
+		rxdp[3].rx.bd_base_info = 0;
+	}
+
+	rxq->rx_rearm_start += HNS3_DEFAULT_RXQ_REARM_THRESH;
+	if (rxq->rx_rearm_start >= rxq->nb_rx_desc)
+		rxq->rx_rearm_start = 0;
+
+	rxq->rx_rearm_nb -= HNS3_DEFAULT_RXQ_REARM_THRESH;
+
+	hns3_write_reg_opt(rxq->io_head_reg, HNS3_DEFAULT_RXQ_REARM_THRESH);
+}
 #endif /* HNS3_RXTX_VEC_H */
diff --git a/drivers/net/hns3/hns3_rxtx_vec_sve.c b/drivers/net/hns3/hns3_rxtx_vec_sve.c
index 5011544e07..54aef7db8d 100644
--- a/drivers/net/hns3/hns3_rxtx_vec_sve.c
+++ b/drivers/net/hns3/hns3_rxtx_vec_sve.c
@@ -237,54 +237,6 @@ hns3_recv_burst_vec_sve(struct hns3_rx_queue *__restrict rxq,
 	return nb_rx;
 }
 
-static inline void
-hns3_rxq_rearm_mbuf_sve(struct hns3_rx_queue *rxq)
-{
-#define REARM_LOOP_STEP_NUM	4
-	struct hns3_entry *rxep = &rxq->sw_ring[rxq->rx_rearm_start];
-	struct hns3_desc *rxdp = rxq->rx_ring + rxq->rx_rearm_start;
-	struct hns3_entry *rxep_tmp = rxep;
-	int i;
-
-	if (unlikely(rte_mempool_get_bulk(rxq->mb_pool, (void *)rxep,
-					  HNS3_DEFAULT_RXQ_REARM_THRESH) < 0)) {
-		rte_eth_devices[rxq->port_id].data->rx_mbuf_alloc_failed++;
-		return;
-	}
-
-	for (i = 0; i < HNS3_DEFAULT_RXQ_REARM_THRESH; i += REARM_LOOP_STEP_NUM,
-		rxep_tmp += REARM_LOOP_STEP_NUM) {
-		svuint64_t prf = svld1_u64(PG64_256BIT, (uint64_t *)rxep_tmp);
-		svprfd_gather_u64base(PG64_256BIT, prf, SV_PLDL1STRM);
-	}
-
-	for (i = 0; i < HNS3_DEFAULT_RXQ_REARM_THRESH; i += REARM_LOOP_STEP_NUM,
-		rxep += REARM_LOOP_STEP_NUM, rxdp += REARM_LOOP_STEP_NUM) {
-		uint64_t iova[REARM_LOOP_STEP_NUM];
-		iova[0] = rte_mbuf_iova_get(rxep[0].mbuf);
-		iova[1] = rte_mbuf_iova_get(rxep[1].mbuf);
-		iova[2] = rte_mbuf_iova_get(rxep[2].mbuf);
-		iova[3] = rte_mbuf_iova_get(rxep[3].mbuf);
-		svuint64_t siova = svld1_u64(PG64_256BIT, iova);
-		siova = svadd_n_u64_z(PG64_256BIT, siova, RTE_PKTMBUF_HEADROOM);
-		svuint64_t ol_base = svdup_n_u64(0);
-		svst1_scatter_u64offset_u64(PG64_256BIT,
-			(uint64_t *)&rxdp[0].addr,
-			svindex_u64(BD_FIELD_ADDR_OFFSET, BD_SIZE), siova);
-		svst1_scatter_u64offset_u64(PG64_256BIT,
-			(uint64_t *)&rxdp[0].addr,
-			svindex_u64(BD_FIELD_OL_OFFSET, BD_SIZE), ol_base);
-	}
-
-	rxq->rx_rearm_start += HNS3_DEFAULT_RXQ_REARM_THRESH;
-	if (rxq->rx_rearm_start >= rxq->nb_rx_desc)
-		rxq->rx_rearm_start = 0;
-
-	rxq->rx_rearm_nb -= HNS3_DEFAULT_RXQ_REARM_THRESH;
-
-	hns3_write_reg_opt(rxq->io_head_reg, HNS3_DEFAULT_RXQ_REARM_THRESH);
-}
-
 uint16_t
 hns3_recv_pkts_vec_sve(void *__restrict rx_queue,
 		       struct rte_mbuf **__restrict rx_pkts,
@@ -300,7 +252,7 @@ hns3_recv_pkts_vec_sve(void *__restrict rx_queue,
 	nb_pkts = RTE_ALIGN_FLOOR(nb_pkts, HNS3_SVE_DEFAULT_DESCS_PER_LOOP);
 
 	if (rxq->rx_rearm_nb > HNS3_DEFAULT_RXQ_REARM_THRESH)
-		hns3_rxq_rearm_mbuf_sve(rxq);
+		hns3_rxq_rearm_mbuf(rxq);
 
 	if (unlikely(!(rxdp->rx.bd_base_info &
 			rte_cpu_to_le_32(1u << HNS3_RXD_VLD_B))))
@@ -331,7 +283,7 @@ hns3_recv_pkts_vec_sve(void *__restrict rx_queue,
 			break;
 
 		if (rxq->rx_rearm_nb > HNS3_DEFAULT_RXQ_REARM_THRESH)
-			hns3_rxq_rearm_mbuf_sve(rxq);
+			hns3_rxq_rearm_mbuf(rxq);
 	}
 
 	return nb_rx;
-- 
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/5] net/hns3: optimize SVE Rx performance
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
                   ` (3 preceding siblings ...)
  2023-07-11 10:24 ` [PATCH 4/5] net/hns3: optimize the rearm mbuf function for SVE Rx Dongdong Liu
@ 2023-07-11 10:24 ` Dongdong Liu
  2023-07-11 10:48 ` [PATCH 0/5] net/hns3: some performance optimizations Ferruh Yigit
  2023-09-25  2:33 ` Jie Hai
  6 siblings, 0 replies; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 10:24 UTC (permalink / raw)
  To: dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

From: Huisong Li <lihuisong@huawei.com>

This patch optimizes SVE Rx performance by the following ways:
1> optimize the calculation of valid BD number.
2> remove a temporary variable (key_fields)
3> use C language to parse some descriptor fields, instead of
   SVE instruction.
4> small step prefetch descriptor.

On the rxonly forwarding mode, the performance of a single queue
or 64B packet is improved by ~40%.

Signed-off-by: Huisong Li <lihuisong@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
 drivers/net/hns3/hns3_rxtx_vec_sve.c | 138 ++++++---------------------
 1 file changed, 28 insertions(+), 110 deletions(-)

diff --git a/drivers/net/hns3/hns3_rxtx_vec_sve.c b/drivers/net/hns3/hns3_rxtx_vec_sve.c
index 54aef7db8d..0e9abfebec 100644
--- a/drivers/net/hns3/hns3_rxtx_vec_sve.c
+++ b/drivers/net/hns3/hns3_rxtx_vec_sve.c
@@ -20,40 +20,36 @@
 
 #define BD_SIZE			32
 #define BD_FIELD_ADDR_OFFSET	0
-#define BD_FIELD_L234_OFFSET	8
-#define BD_FIELD_XLEN_OFFSET	12
-#define BD_FIELD_RSS_OFFSET	16
-#define BD_FIELD_OL_OFFSET	24
 #define BD_FIELD_VALID_OFFSET	28
 
-typedef struct {
-	uint32_t l234_info[HNS3_SVE_DEFAULT_DESCS_PER_LOOP];
-	uint32_t ol_info[HNS3_SVE_DEFAULT_DESCS_PER_LOOP];
-	uint32_t bd_base_info[HNS3_SVE_DEFAULT_DESCS_PER_LOOP];
-} HNS3_SVE_KEY_FIELD_S;
-
 static inline uint32_t
 hns3_desc_parse_field_sve(struct hns3_rx_queue *rxq,
 			  struct rte_mbuf **rx_pkts,
-			  HNS3_SVE_KEY_FIELD_S *key,
+			  struct hns3_desc *rxdp,
 			  uint32_t   bd_vld_num)
 {
+	uint32_t l234_info, ol_info, bd_base_info;
 	uint32_t retcode = 0;
 	int ret, i;
 
 	for (i = 0; i < (int)bd_vld_num; i++) {
 		/* init rte_mbuf.rearm_data last 64-bit */
 		rx_pkts[i]->ol_flags = RTE_MBUF_F_RX_RSS_HASH;
-
-		ret = hns3_handle_bdinfo(rxq, rx_pkts[i], key->bd_base_info[i],
-					 key->l234_info[i]);
+		rx_pkts[i]->hash.rss = rxdp[i].rx.rss_hash;
+		rx_pkts[i]->pkt_len = rte_le_to_cpu_16(rxdp[i].rx.pkt_len) -
+					rxq->crc_len;
+		rx_pkts[i]->data_len = rx_pkts[i]->pkt_len;
+
+		l234_info = rxdp[i].rx.l234_info;
+		ol_info = rxdp[i].rx.ol_info;
+		bd_base_info = rxdp[i].rx.bd_base_info;
+		ret = hns3_handle_bdinfo(rxq, rx_pkts[i], bd_base_info, l234_info);
 		if (unlikely(ret)) {
 			retcode |= 1u << i;
 			continue;
 		}
 
-		rx_pkts[i]->packet_type = hns3_rx_calc_ptype(rxq,
-					key->l234_info[i], key->ol_info[i]);
+		rx_pkts[i]->packet_type = hns3_rx_calc_ptype(rxq, l234_info, ol_info);
 
 		/* Increment bytes counter */
 		rxq->basic_stats.bytes += rx_pkts[i]->pkt_len;
@@ -77,46 +73,16 @@ hns3_recv_burst_vec_sve(struct hns3_rx_queue *__restrict rxq,
 			uint16_t nb_pkts,
 			uint64_t *bd_err_mask)
 {
-#define XLEN_ADJUST_LEN		32
-#define RSS_ADJUST_LEN		16
-#define GEN_VLD_U8_ZIP_INDEX	svindex_s8(28, -4)
 	uint16_t rx_id = rxq->next_to_use;
 	struct hns3_entry *sw_ring = &rxq->sw_ring[rx_id];
 	struct hns3_desc *rxdp = &rxq->rx_ring[rx_id];
-	struct hns3_desc *rxdp2;
-	HNS3_SVE_KEY_FIELD_S key_field;
+	struct hns3_desc *rxdp2, *next_rxdp;
 	uint64_t bd_valid_num;
 	uint32_t parse_retcode;
 	uint16_t nb_rx = 0;
 	int pos, offset;
 
-	uint16_t xlen_adjust[XLEN_ADJUST_LEN] = {
-		0,  0xffff, 1,  0xffff,    /* 1st mbuf: pkt_len and dat_len */
-		2,  0xffff, 3,  0xffff,    /* 2st mbuf: pkt_len and dat_len */
-		4,  0xffff, 5,  0xffff,    /* 3st mbuf: pkt_len and dat_len */
-		6,  0xffff, 7,  0xffff,    /* 4st mbuf: pkt_len and dat_len */
-		8,  0xffff, 9,  0xffff,    /* 5st mbuf: pkt_len and dat_len */
-		10, 0xffff, 11, 0xffff,    /* 6st mbuf: pkt_len and dat_len */
-		12, 0xffff, 13, 0xffff,    /* 7st mbuf: pkt_len and dat_len */
-		14, 0xffff, 15, 0xffff,    /* 8st mbuf: pkt_len and dat_len */
-	};
-
-	uint32_t rss_adjust[RSS_ADJUST_LEN] = {
-		0, 0xffff,        /* 1st mbuf: rss */
-		1, 0xffff,        /* 2st mbuf: rss */
-		2, 0xffff,        /* 3st mbuf: rss */
-		3, 0xffff,        /* 4st mbuf: rss */
-		4, 0xffff,        /* 5st mbuf: rss */
-		5, 0xffff,        /* 6st mbuf: rss */
-		6, 0xffff,        /* 7st mbuf: rss */
-		7, 0xffff,        /* 8st mbuf: rss */
-	};
-
 	svbool_t pg32 = svwhilelt_b32(0, HNS3_SVE_DEFAULT_DESCS_PER_LOOP);
-	svuint16_t xlen_tbl1 = svld1_u16(PG16_256BIT, xlen_adjust);
-	svuint16_t xlen_tbl2 = svld1_u16(PG16_256BIT, &xlen_adjust[16]);
-	svuint32_t rss_tbl1 = svld1_u32(PG32_256BIT, rss_adjust);
-	svuint32_t rss_tbl2 = svld1_u32(PG32_256BIT, &rss_adjust[8]);
 
 	/* compile-time verifies the xlen_adjust mask */
 	RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) !=
@@ -126,30 +92,21 @@ hns3_recv_burst_vec_sve(struct hns3_rx_queue *__restrict rxq,
 
 	for (pos = 0; pos < nb_pkts; pos += HNS3_SVE_DEFAULT_DESCS_PER_LOOP,
 				     rxdp += HNS3_SVE_DEFAULT_DESCS_PER_LOOP) {
-		svuint64_t vld_clz, mbp1st, mbp2st, mbuf_init;
-		svuint64_t xlen1st, xlen2st, rss1st, rss2st;
-		svuint32_t l234, ol, vld, vld2, xlen, rss;
-		svuint8_t  vld_u8;
+		svuint64_t mbp1st, mbp2st, mbuf_init;
+		svuint32_t vld;
+		svbool_t vld_op;
 
 		/* calc how many bd valid: part 1 */
 		vld = svld1_gather_u32offset_u32(pg32, (uint32_t *)rxdp,
 			svindex_u32(BD_FIELD_VALID_OFFSET, BD_SIZE));
-		vld2 = svlsl_n_u32_z(pg32, vld,
-				    HNS3_UINT32_BIT - 1 - HNS3_RXD_VLD_B);
-		vld2 = svreinterpret_u32_s32(svasr_n_s32_z(pg32,
-			svreinterpret_s32_u32(vld2), HNS3_UINT32_BIT - 1));
+		vld = svand_n_u32_z(pg32, vld, BIT(HNS3_RXD_VLD_B));
+		vld_op = svcmpne_n_u32(pg32, vld, BIT(HNS3_RXD_VLD_B));
+		bd_valid_num = svcntp_b32(pg32, svbrkb_b_z(pg32, vld_op));
+		if (bd_valid_num == 0)
+			break;
 
 		/* load 4 mbuf pointer */
 		mbp1st = svld1_u64(PG64_256BIT, (uint64_t *)&sw_ring[pos]);
-
-		/* calc how many bd valid: part 2 */
-		vld_u8 = svtbl_u8(svreinterpret_u8_u32(vld2),
-				  svreinterpret_u8_s8(GEN_VLD_U8_ZIP_INDEX));
-		vld_clz = svnot_u64_z(PG64_64BIT, svreinterpret_u64_u8(vld_u8));
-		vld_clz = svclz_u64_z(PG64_64BIT, vld_clz);
-		svst1_u64(PG64_64BIT, &bd_valid_num, vld_clz);
-		bd_valid_num /= HNS3_UINT8_BIT;
-
 		/* load 4 more mbuf pointer */
 		mbp2st = svld1_u64(PG64_256BIT, (uint64_t *)&sw_ring[pos + 4]);
 
@@ -159,65 +116,25 @@ hns3_recv_burst_vec_sve(struct hns3_rx_queue *__restrict rxq,
 
 		/* store 4 mbuf pointer into rx_pkts */
 		svst1_u64(PG64_256BIT, (uint64_t *)&rx_pkts[pos], mbp1st);
-
-		/* load key field to vector reg */
-		l234 = svld1_gather_u32offset_u32(pg32, (uint32_t *)rxdp2,
-				svindex_u32(BD_FIELD_L234_OFFSET, BD_SIZE));
-		ol = svld1_gather_u32offset_u32(pg32, (uint32_t *)rxdp2,
-				svindex_u32(BD_FIELD_OL_OFFSET, BD_SIZE));
-
 		/* store 4 mbuf pointer into rx_pkts again */
 		svst1_u64(PG64_256BIT, (uint64_t *)&rx_pkts[pos + 4], mbp2st);
 
-		/* load datalen, pktlen and rss_hash */
-		xlen = svld1_gather_u32offset_u32(pg32, (uint32_t *)rxdp2,
-				svindex_u32(BD_FIELD_XLEN_OFFSET, BD_SIZE));
-		rss = svld1_gather_u32offset_u32(pg32, (uint32_t *)rxdp2,
-				svindex_u32(BD_FIELD_RSS_OFFSET, BD_SIZE));
-
-		/* store key field to stash buffer */
-		svst1_u32(pg32, (uint32_t *)key_field.l234_info, l234);
-		svst1_u32(pg32, (uint32_t *)key_field.bd_base_info, vld);
-		svst1_u32(pg32, (uint32_t *)key_field.ol_info, ol);
-
-		/* sub crc_len for pkt_len and data_len */
-		xlen = svreinterpret_u32_u16(svsub_n_u16_z(PG16_256BIT,
-			svreinterpret_u16_u32(xlen), rxq->crc_len));
-
 		/* init mbuf_initializer */
 		mbuf_init = svdup_n_u64(rxq->mbuf_initializer);
-
-		/* extract datalen, pktlen and rss from xlen and rss */
-		xlen1st = svreinterpret_u64_u16(
-			svtbl_u16(svreinterpret_u16_u32(xlen), xlen_tbl1));
-		xlen2st = svreinterpret_u64_u16(
-			svtbl_u16(svreinterpret_u16_u32(xlen), xlen_tbl2));
-		rss1st = svreinterpret_u64_u32(
-			svtbl_u32(svreinterpret_u32_u32(rss), rss_tbl1));
-		rss2st = svreinterpret_u64_u32(
-			svtbl_u32(svreinterpret_u32_u32(rss), rss_tbl2));
-
 		/* save mbuf_initializer */
 		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp1st,
 			offsetof(struct rte_mbuf, rearm_data), mbuf_init);
 		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp2st,
 			offsetof(struct rte_mbuf, rearm_data), mbuf_init);
 
-		/* save datalen and pktlen and rss */
-		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp1st,
-			offsetof(struct rte_mbuf, pkt_len), xlen1st);
-		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp1st,
-			offsetof(struct rte_mbuf, hash.rss), rss1st);
-		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp2st,
-			offsetof(struct rte_mbuf, pkt_len), xlen2st);
-		svst1_scatter_u64base_offset_u64(PG64_256BIT, mbp2st,
-			offsetof(struct rte_mbuf, hash.rss), rss2st);
-
-		rte_prefetch_non_temporal(rxdp +
-					  HNS3_SVE_DEFAULT_DESCS_PER_LOOP);
+		next_rxdp = rxdp + HNS3_SVE_DEFAULT_DESCS_PER_LOOP;
+		rte_prefetch_non_temporal(next_rxdp);
+		rte_prefetch_non_temporal(next_rxdp + 2);
+		rte_prefetch_non_temporal(next_rxdp + 4);
+		rte_prefetch_non_temporal(next_rxdp + 6);
 
 		parse_retcode = hns3_desc_parse_field_sve(rxq, &rx_pkts[pos],
-					&key_field, bd_valid_num);
+					&rxdp2[offset], bd_valid_num);
 		if (unlikely(parse_retcode))
 			(*bd_err_mask) |= ((uint64_t)parse_retcode) << pos;
 
@@ -237,6 +154,7 @@ hns3_recv_burst_vec_sve(struct hns3_rx_queue *__restrict rxq,
 	return nb_rx;
 }
 
+
 uint16_t
 hns3_recv_pkts_vec_sve(void *__restrict rx_queue,
 		       struct rte_mbuf **__restrict rx_pkts,
-- 
2.22.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] net/hns3: some performance optimizations
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
                   ` (4 preceding siblings ...)
  2023-07-11 10:24 ` [PATCH 5/5] net/hns3: optimize SVE Rx performance Dongdong Liu
@ 2023-07-11 10:48 ` Ferruh Yigit
  2023-07-11 11:27   ` Dongdong Liu
  2023-09-25  2:33 ` Jie Hai
  6 siblings, 1 reply; 13+ messages in thread
From: Ferruh Yigit @ 2023-07-11 10:48 UTC (permalink / raw)
  To: Dongdong Liu, dev, thomas, andrew.rybchenko; +Cc: stable

On 7/11/2023 11:24 AM, Dongdong Liu wrote:
> This patchset is to do some performance optimizations for hns3.
> 
> Huisong Li (5):
>   net/hns3: fix incorrect index to look up table in NEON Rx
>   net/hns3: fix the order of NEON Rx code
>   net/hns3: optimize free mbuf code for SVE Tx
>   net/hns3: optimize the rearm mbuf function for SVE Rx
>   net/hns3: optimize SVE Rx performance
>

Hi Dongdong, Huisong,

Release is around a week away, OK to get critical fixes, but I can see
there are some optimizations as well.

Is this set for current release or next release?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] net/hns3: some performance optimizations
  2023-07-11 10:48 ` [PATCH 0/5] net/hns3: some performance optimizations Ferruh Yigit
@ 2023-07-11 11:27   ` Dongdong Liu
  2023-07-11 12:26     ` Ferruh Yigit
  0 siblings, 1 reply; 13+ messages in thread
From: Dongdong Liu @ 2023-07-11 11:27 UTC (permalink / raw)
  To: Ferruh Yigit, dev, thomas, andrew.rybchenko; +Cc: stable

Hi, Ferruh
On 2023/7/11 18:48, Ferruh Yigit wrote:
> On 7/11/2023 11:24 AM, Dongdong Liu wrote:
>> This patchset is to do some performance optimizations for hns3.
>>
>> Huisong Li (5):
>>   net/hns3: fix incorrect index to look up table in NEON Rx
>>   net/hns3: fix the order of NEON Rx code
>>   net/hns3: optimize free mbuf code for SVE Tx
>>   net/hns3: optimize the rearm mbuf function for SVE Rx
>>   net/hns3: optimize SVE Rx performance
>>
>
> Hi Dongdong, Huisong,
>
> Release is around a week away, OK to get critical fixes, but I can see
> there are some optimizations as well.
>
> Is this set for current release or next release?

If possible, we want this patchset can be applied for current release.

Thanks,
Dongdong
>
> .
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] net/hns3: some performance optimizations
  2023-07-11 11:27   ` Dongdong Liu
@ 2023-07-11 12:26     ` Ferruh Yigit
  2023-09-25 14:26       ` Ferruh Yigit
  0 siblings, 1 reply; 13+ messages in thread
From: Ferruh Yigit @ 2023-07-11 12:26 UTC (permalink / raw)
  To: Dongdong Liu, thomas, andrew.rybchenko; +Cc: stable, dev, David Marchand

On 7/11/2023 12:27 PM, Dongdong Liu wrote:
> Hi, Ferruh
> On 2023/7/11 18:48, Ferruh Yigit wrote:
>> On 7/11/2023 11:24 AM, Dongdong Liu wrote:
>>> This patchset is to do some performance optimizations for hns3.
>>>
>>> Huisong Li (5):
>>>   net/hns3: fix incorrect index to look up table in NEON Rx
>>>   net/hns3: fix the order of NEON Rx code
>>>   net/hns3: optimize free mbuf code for SVE Tx
>>>   net/hns3: optimize the rearm mbuf function for SVE Rx
>>>   net/hns3: optimize SVE Rx performance
>>>
>>
>> Hi Dongdong, Huisong,
>>
>> Release is around a week away, OK to get critical fixes, but I can see
>> there are some optimizations as well.
>>
>> Is this set for current release or next release?
> 
> If possible, we want this patchset can be applied for current release.
> 
> 

I can see there is a good performance increase, this makes harder to
defer but I am feeling this level of change is risky and you won't have
time to test and fix any issue.

Let me get first patch, as it is a fix. I can merge remaining patches
early in next release cycle.
@Thomas, what do you think?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx
  2023-07-11 10:24 ` [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx Dongdong Liu
@ 2023-07-11 12:58   ` Ferruh Yigit
  0 siblings, 0 replies; 13+ messages in thread
From: Ferruh Yigit @ 2023-07-11 12:58 UTC (permalink / raw)
  To: Dongdong Liu, thomas, andrew.rybchenko; +Cc: stable, dev

On 7/11/2023 11:24 AM, Dongdong Liu wrote:
> From: Huisong Li <lihuisong@huawei.com>
> 
> In hns3_recv_burst_vec(), the index to get packet length and data
> size are reversed. Fortunately, this doesn't affect functionality
> because the NEON Rx only supports single BD in which the packet
> length is equal to the date size. Now this patch fixes it to get
> back to the truth.
> 
> Fixes: a3d4f4d291d7 ("net/hns3: support NEON Rx")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
>

(Just for this patch, not series)
Applied to dpdk-next-net/main, thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] net/hns3: some performance optimizations
  2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
                   ` (5 preceding siblings ...)
  2023-07-11 10:48 ` [PATCH 0/5] net/hns3: some performance optimizations Ferruh Yigit
@ 2023-09-25  2:33 ` Jie Hai
  6 siblings, 0 replies; 13+ messages in thread
From: Jie Hai @ 2023-09-25  2:33 UTC (permalink / raw)
  To: Dongdong Liu, dev, ferruh.yigit, thomas, andrew.rybchenko; +Cc: stable

Hi, all maintainers,

Kindly ping for patch 2/5-5/5.

Best regards, Jie Hai

On 2023/7/11 18:24, Dongdong Liu wrote:
> This patchset is to do some performance optimizations for hns3.
> 
> Huisong Li (5):
>    net/hns3: fix incorrect index to look up table in NEON Rx
>    net/hns3: fix the order of NEON Rx code
>    net/hns3: optimize free mbuf code for SVE Tx
>    net/hns3: optimize the rearm mbuf function for SVE Rx
>    net/hns3: optimize SVE Rx performance
> 
>   drivers/net/hns3/hns3_rxtx_vec.c      |  51 ------
>   drivers/net/hns3/hns3_rxtx_vec.h      |  51 ++++++
>   drivers/net/hns3/hns3_rxtx_vec_neon.h |  82 ++++-----
>   drivers/net/hns3/hns3_rxtx_vec_sve.c  | 230 ++++----------------------
>   4 files changed, 114 insertions(+), 300 deletions(-)
> 
> --
> 2.22.0
> 
> 
> .

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx
  2023-07-11 10:24 ` [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx Dongdong Liu
@ 2023-09-25 14:21   ` Ferruh Yigit
  0 siblings, 0 replies; 13+ messages in thread
From: Ferruh Yigit @ 2023-09-25 14:21 UTC (permalink / raw)
  To: Dongdong Liu, dev, thomas, andrew.rybchenko
  Cc: stable, Honnappa Nagarahalli, Ruifeng Wang

On 7/11/2023 11:24 AM, Dongdong Liu wrote:
> From: Huisong Li <lihuisong@huawei.com>
> 
> Currently, hns3 SVE Tx checks the valid bits of all descriptors
> in a batch and then determines whether to release the corresponding
> mbufs. Actually, once the valid bit of any descriptor in a batch
> isn't cleared, driver does not need to scan the rest of descriptors.
> 
> If we optimize SVE codes algorithm about this function, the performance
> of a single queue for 64B packet is improved by ~2% on txonly forwarding
> mode. And if use C code to scan all descriptors, the performance is
> improved by ~8%.
> 
> So this patch selects C code to optimize this code to improve the SVE
> Tx performance.
> 
> Signed-off-by: Huisong Li <lihuisong@huawei.com>
> Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
> 

SVE Tx optimized by removing SVE implementation :)

Do you have any insight why generic vector implementation is faster?




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/5] net/hns3: some performance optimizations
  2023-07-11 12:26     ` Ferruh Yigit
@ 2023-09-25 14:26       ` Ferruh Yigit
  0 siblings, 0 replies; 13+ messages in thread
From: Ferruh Yigit @ 2023-09-25 14:26 UTC (permalink / raw)
  To: Dongdong Liu, thomas, andrew.rybchenko; +Cc: stable, dev, David Marchand

On 7/11/2023 1:26 PM, Ferruh Yigit wrote:
> On 7/11/2023 12:27 PM, Dongdong Liu wrote:
>> Hi, Ferruh
>> On 2023/7/11 18:48, Ferruh Yigit wrote:
>>> On 7/11/2023 11:24 AM, Dongdong Liu wrote:
>>>> This patchset is to do some performance optimizations for hns3.
>>>>
>>>> Huisong Li (5):
>>>>   net/hns3: fix incorrect index to look up table in NEON Rx
>>>>   net/hns3: fix the order of NEON Rx code
>>>>   net/hns3: optimize free mbuf code for SVE Tx
>>>>   net/hns3: optimize the rearm mbuf function for SVE Rx
>>>>   net/hns3: optimize SVE Rx performance
>>>>
>>>
>>> Hi Dongdong, Huisong,
>>>
>>> Release is around a week away, OK to get critical fixes, but I can see
>>> there are some optimizations as well.
>>>
>>> Is this set for current release or next release?
>>
>> If possible, we want this patchset can be applied for current release.
>>
>>
> 
> I can see there is a good performance increase, this makes harder to
> defer but I am feeling this level of change is risky and you won't have
> time to test and fix any issue.
> 
> Let me get first patch, as it is a fix. I can merge remaining patches
> early in next release cycle.
> @Thomas, what do you think?
> 

Series applied to dpdk-next-net/main, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-09-25 14:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11 10:24 [PATCH 0/5] net/hns3: some performance optimizations Dongdong Liu
2023-07-11 10:24 ` [PATCH 1/5] net/hns3: fix incorrect index to look up table in NEON Rx Dongdong Liu
2023-07-11 12:58   ` Ferruh Yigit
2023-07-11 10:24 ` [PATCH 2/5] net/hns3: fix the order of NEON Rx code Dongdong Liu
2023-07-11 10:24 ` [PATCH 3/5] net/hns3: optimize free mbuf code for SVE Tx Dongdong Liu
2023-09-25 14:21   ` Ferruh Yigit
2023-07-11 10:24 ` [PATCH 4/5] net/hns3: optimize the rearm mbuf function for SVE Rx Dongdong Liu
2023-07-11 10:24 ` [PATCH 5/5] net/hns3: optimize SVE Rx performance Dongdong Liu
2023-07-11 10:48 ` [PATCH 0/5] net/hns3: some performance optimizations Ferruh Yigit
2023-07-11 11:27   ` Dongdong Liu
2023-07-11 12:26     ` Ferruh Yigit
2023-09-25 14:26       ` Ferruh Yigit
2023-09-25  2:33 ` Jie Hai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).