DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1 0/4] fix note error
@ 2021-07-23  3:10 Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 1/4] drivers/net: remove redundant phrases Feifei Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Feifei Wang @ 2021-07-23  3:10 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

Fix drivers/net note error and do some optimization for
i40e NEON path.

Feifei Wang (4):
  drivers/net: remove redundant phrases
  drivers/net: fix note error for Rx vector
  net/i40e: reorder Rx NEON code for better readability
  net/i40e: change code order to reduce L1 cache misses

 drivers/net/fm10k/fm10k_rxtx_vec.c       |   6 +-
 drivers/net/i40e/i40e_rxtx_vec_altivec.c |  10 +--
 drivers/net/i40e/i40e_rxtx_vec_neon.c    | 101 ++++++++++-------------
 drivers/net/i40e/i40e_rxtx_vec_sse.c     |   6 +-
 drivers/net/iavf/iavf_rxtx_vec_sse.c     |  12 +--
 drivers/net/ice/ice_rxtx_vec_sse.c       |   6 +-
 drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c   |   6 +-
 7 files changed, 68 insertions(+), 79 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] [PATCH v1 1/4] drivers/net: remove redundant phrases
  2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
@ 2021-07-23  3:10 ` Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 2/4] drivers/net: fix note error for Rx vector Feifei Wang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Feifei Wang @ 2021-07-23  3:10 UTC (permalink / raw)
  To: Qi Zhang, Xiao Wang, David Christensen, Beilei Xing,
	Ruifeng Wang, Bruce Richardson, Konstantin Ananyev, Jingjing Wu,
	Qiming Yang, Haiyue Wang, Cunming Liang, Chen Jing D(Mark),
	Chao Zhu, Gowrishankar Muthukrishnan, Jerin Jacob, Jianbo Liu,
	Zhe Tao, Leyi Rong, Wenzhuo Lu
  Cc: dev, nd, Feifei Wang, stable

For the note of Rx vec path,when extract and record EOP bit, the code
note should be "as the count of dd bits doesn't care", remove the
redundant "count".

fm10k:
Fixes: 7092be8437bd ("fm10k: add vector Rx")
Cc: jing.d.chen@intel.com

i40e-altive:
Fixes: c3def6a8724c ("net/i40e: implement vector PMD for altivec")
Cc: gowrishankar.m@linux.vnet.ibm.com

i40e-neon:
Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")

i40e-sse:
Fixes: 9ed94e5bb04e ("i40e: add vector Rx")
Cc: zhe.tao@intel.com

iavf:
Fixes: 319c421f3890 ("net/avf: enable SSE Rx Tx")
Cc: jingjing.wu@intel.com
Fixes: 1162f5a0ef31 ("net/iavf: support flexible Rx descriptor in SSE path")
Cc: leyi.rong@intel.com

ice:
Fixes: c68a52b8b38c ("net/ice: support vector SSE in Rx")
Cc: wenzhuo.lu@intel.com

ixgbe:
Fixes: cf4b4708a88a ("ixgbe: improve slow-path perf with vector scattered Rx")
Cc: bruce.richardson@intel.com

Cc: stable@dpdk.org

Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/fm10k/fm10k_rxtx_vec.c       | 2 +-
 drivers/net/i40e/i40e_rxtx_vec_altivec.c | 2 +-
 drivers/net/i40e/i40e_rxtx_vec_neon.c    | 2 +-
 drivers/net/i40e/i40e_rxtx_vec_sse.c     | 2 +-
 drivers/net/iavf/iavf_rxtx_vec_sse.c     | 4 ++--
 drivers/net/ice/ice_rxtx_vec_sse.c       | 2 +-
 drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c   | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/fm10k/fm10k_rxtx_vec.c b/drivers/net/fm10k/fm10k_rxtx_vec.c
index 39e3cdac1f..cae5322d48 100644
--- a/drivers/net/fm10k/fm10k_rxtx_vec.c
+++ b/drivers/net/fm10k/fm10k_rxtx_vec.c
@@ -544,7 +544,7 @@ fm10k_recv_raw_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index 1ad74646d6..edaa462ac8 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -398,7 +398,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 				(vector unsigned char)vec_nor(staterr, staterr),
 				(vector unsigned char)eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 1f5539bda8..32336fdb80 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -387,7 +387,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			eop_bits = vmvnq_u8(vreinterpretq_u8_u16(staterr));
 			eop_bits = vandq_u8(eop_bits, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
index bfa5aff48d..03a0320353 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
@@ -557,7 +557,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/iavf/iavf_rxtx_vec_sse.c b/drivers/net/iavf/iavf_rxtx_vec_sse.c
index bf87696fa4..b813d96ef4 100644
--- a/drivers/net/iavf/iavf_rxtx_vec_sse.c
+++ b/drivers/net/iavf/iavf_rxtx_vec_sse.c
@@ -590,7 +590,7 @@ _recv_raw_pkts_vec(struct iavf_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
@@ -884,7 +884,7 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 673e44a243..5f7e13ee39 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -545,7 +545,7 @@ _ice_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
index 7610fd93db..3a3ef51172 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
@@ -540,7 +540,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 			/* and with mask to extract bits, flipping 1-0 */
 			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
 			/* the staterr values are not in order, as the count
-			 * count of dd bits doesn't care. However, for end of
+			 * of dd bits doesn't care. However, for end of
 			 * packet tracking, we do care, so shuffle. This also
 			 * compresses the 32-bit values to 8-bit
 			 */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] [PATCH v1 2/4] drivers/net: fix note error for Rx vector
  2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 1/4] drivers/net: remove redundant phrases Feifei Wang
@ 2021-07-23  3:10 ` Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 3/4] net/i40e: reorder Rx NEON code for better readability Feifei Wang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Feifei Wang @ 2021-07-23  3:10 UTC (permalink / raw)
  To: Qi Zhang, Xiao Wang, David Christensen, Beilei Xing,
	Ruifeng Wang, Bruce Richardson, Konstantin Ananyev, Jingjing Wu,
	Qiming Yang, Haiyue Wang, Chen Jing D(Mark),
	Cunming Liang, Chao Zhu, Gowrishankar Muthukrishnan, Jerin Jacob,
	Jianbo Liu, Zhe Tao, Leyi Rong, Wenzhuo Lu
  Cc: dev, nd, Feifei Wang, stable

For the loop to process packets in Rx vector path, some notes for the
code are wrong, fix these errors.

fm10k:
Fixes: 7092be8437bd ("fm10k: add vector Rx")
Cc: jing.d.chen@intel.com

i40e-altive:
Fixes: c3def6a8724c ("net/i40e: implement vector PMD for altivec")
Cc: gowrishankar.m@linux.vnet.ibm.com

i40e-neon:
Fixes: ae0eb310f253 ("net/i40e: implement vector PMD for ARM")

i40e-sse:
Fixes: 9ed94e5bb04e ("i40e: add vector Rx")
Cc: zhe.tao@intel.com

iavf:
Fixes: 319c421f3890 ("net/avf: enable SSE Rx Tx")
Cc: jingjing.wu@intel.com
Fixes: 1162f5a0ef31 ("net/iavf: support flexible Rx descriptor in SSE path")
Cc: leyi.rong@intel.com

ice:
Fixes: c68a52b8b38c ("net/ice: support vector SSE in Rx")
Cc: wenzhuo.lu@intel.com

ixgbe:
Fixes: cf4b4708a88a ("ixgbe: improve slow-path perf with vector scattered Rx")
Cc: bruce.richardson@intel.com

Cc: stable@dpdk.org

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/fm10k/fm10k_rxtx_vec.c       | 4 ++--
 drivers/net/i40e/i40e_rxtx_vec_altivec.c | 8 ++++----
 drivers/net/i40e/i40e_rxtx_vec_neon.c    | 8 ++++----
 drivers/net/i40e/i40e_rxtx_vec_sse.c     | 4 ++--
 drivers/net/iavf/iavf_rxtx_vec_sse.c     | 8 ++++----
 drivers/net/ice/ice_rxtx_vec_sse.c       | 4 ++--
 drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c   | 4 ++--
 7 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/net/fm10k/fm10k_rxtx_vec.c b/drivers/net/fm10k/fm10k_rxtx_vec.c
index cae5322d48..83af01dc2d 100644
--- a/drivers/net/fm10k/fm10k_rxtx_vec.c
+++ b/drivers/net/fm10k/fm10k_rxtx_vec.c
@@ -472,7 +472,7 @@ fm10k_recv_raw_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		mbp1 = _mm_loadu_si128((__m128i *)&mbufp[pos]);
 
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs0[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -484,9 +484,9 @@ fm10k_recv_raw_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		mbp2 = _mm_loadu_si128((__m128i *)&mbufp[pos+2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs0[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs0[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs0[0] = _mm_loadu_si128((__m128i *)(rxdp));
diff --git a/drivers/net/i40e/i40e_rxtx_vec_altivec.c b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
index edaa462ac8..b99323992f 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_altivec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_altivec.c
@@ -281,22 +281,22 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 						  * in one XMM reg.
 						  */
 
-		/* B.1 load 1 mbuf point */
+		/* B.1 load 2 mbuf point */
 		mbp1 = *(vector unsigned long *)&sw_ring[pos];
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = *(vector unsigned long *)(rxdp + 3);
 		rte_compiler_barrier();
 
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		*(vector unsigned long *)&rx_pkts[pos] = mbp1;
 
-		/* B.1 load 1 mbuf point */
+		/* B.1 load 2 mbuf point */
 		mbp2 = *(vector unsigned long *)&sw_ring[pos + 2];
 
+		/* A.1 load desc[2-0] */
 		descs[2] = *(vector unsigned long *)(rxdp + 2);
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = *(vector unsigned long *)(rxdp + 1);
 		rte_compiler_barrier();
 		descs[0] = *(vector unsigned long *)(rxdp);
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 32336fdb80..fb624a4882 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -280,20 +280,20 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 
 		int32x4_t len_shl = {0, 0, 0, PKTLEN_SHIFT};
 
-		/* B.1 load 1 mbuf point */
+		/* B.1 load 2 mbuf point */
 		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
 
 		/* B.2 copy 2 mbuf point into rx_pkts  */
 		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
 
-		/* B.1 load 1 mbuf point */
+		/* B.1 load 2 mbuf point */
 		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
 
+		/* A.1 load desc[2-0] */
 		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
-		/* B.1 load 2 mbuf point */
 		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
 		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
 
diff --git a/drivers/net/i40e/i40e_rxtx_vec_sse.c b/drivers/net/i40e/i40e_rxtx_vec_sse.c
index 03a0320353..b235502db5 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_sse.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_sse.c
@@ -462,7 +462,7 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -474,9 +474,9 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos+2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
diff --git a/drivers/net/iavf/iavf_rxtx_vec_sse.c b/drivers/net/iavf/iavf_rxtx_vec_sse.c
index b813d96ef4..ee1e905525 100644
--- a/drivers/net/iavf/iavf_rxtx_vec_sse.c
+++ b/drivers/net/iavf/iavf_rxtx_vec_sse.c
@@ -494,7 +494,7 @@ _recv_raw_pkts_vec(struct iavf_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -506,9 +506,9 @@ _recv_raw_pkts_vec(struct iavf_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
@@ -755,7 +755,7 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq,
 		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -767,9 +767,9 @@ _recv_raw_pkts_vec_flex_rxd(struct iavf_rx_queue *rxq,
 		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
diff --git a/drivers/net/ice/ice_rxtx_vec_sse.c b/drivers/net/ice/ice_rxtx_vec_sse.c
index 5f7e13ee39..653bd28b41 100644
--- a/drivers/net/ice/ice_rxtx_vec_sse.c
+++ b/drivers/net/ice/ice_rxtx_vec_sse.c
@@ -416,7 +416,7 @@ _ice_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		/* B.1 load 2 (64 bit) or 4 (32 bit) mbuf points */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -428,9 +428,9 @@ _ice_recv_raw_pkts_vec(struct ice_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos + 2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
index 3a3ef51172..1dea95e73b 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c
@@ -454,7 +454,7 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 
 		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load 4 pkts desc */
+		/* A.1 load desc[3] */
 		descs[3] = _mm_loadu_si128((__m128i *)(rxdp + 3));
 		rte_compiler_barrier();
 
@@ -466,9 +466,9 @@ _recv_raw_pkts_vec(struct ixgbe_rx_queue *rxq, struct rte_mbuf **rx_pkts,
 		mbp2 = _mm_loadu_si128((__m128i *)&sw_ring[pos+2]);
 #endif
 
+		/* A.1 load desc[2-0] */
 		descs[2] = _mm_loadu_si128((__m128i *)(rxdp + 2));
 		rte_compiler_barrier();
-		/* B.1 load 2 mbuf point */
 		descs[1] = _mm_loadu_si128((__m128i *)(rxdp + 1));
 		rte_compiler_barrier();
 		descs[0] = _mm_loadu_si128((__m128i *)(rxdp));
-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] [PATCH v1 3/4] net/i40e: reorder Rx NEON code for better readability
  2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 1/4] drivers/net: remove redundant phrases Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 2/4] drivers/net: fix note error for Rx vector Feifei Wang
@ 2021-07-23  3:10 ` Feifei Wang
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 4/4] net/i40e: change code order to reduce L1 cache misses Feifei Wang
  2021-08-10  3:00 ` [dpdk-dev] [PATCH v1 0/4] fix note error Zhang, Qi Z
  4 siblings, 0 replies; 6+ messages in thread
From: Feifei Wang @ 2021-07-23  3:10 UTC (permalink / raw)
  To: Ruifeng Wang, Beilei Xing; +Cc: dev, nd, Feifei Wang, Joyce Kong

Rearrange the code in logical order for better readability and maintenance
convenience in Rx NEON path.

No performance change with this patch in arm platform.

Suggested-by: Joyce Kong <joyce.kong@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx_vec_neon.c | 99 ++++++++++++---------------
 1 file changed, 44 insertions(+), 55 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index fb624a4882..8f3188e910 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -280,24 +280,18 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 
 		int32x4_t len_shl = {0, 0, 0, PKTLEN_SHIFT};
 
-		/* B.1 load 2 mbuf point */
-		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
-		/* Read desc statuses backwards to avoid race condition */
-		/* A.1 load desc[3] */
+		/* A.1 load desc[3-0] */
 		descs[3] =  vld1q_u64((uint64_t *)(rxdp + 3));
-
-		/* B.2 copy 2 mbuf point into rx_pkts  */
-		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
-
-		/* B.1 load 2 mbuf point */
-		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
-
-		/* A.1 load desc[2-0] */
 		descs[2] =  vld1q_u64((uint64_t *)(rxdp + 2));
 		descs[1] =  vld1q_u64((uint64_t *)(rxdp + 1));
 		descs[0] =  vld1q_u64((uint64_t *)(rxdp));
 
-		/* B.2 copy 2 mbuf point into rx_pkts  */
+		/* B.1 load 4 mbuf point */
+		mbp1 = vld1q_u64((uint64_t *)&sw_ring[pos]);
+		mbp2 = vld1q_u64((uint64_t *)&sw_ring[pos + 2]);
+
+		/* B.2 copy 4 mbuf point into rx_pkts  */
+		vst1q_u64((uint64_t *)&rx_pkts[pos], mbp1);
 		vst1q_u64((uint64_t *)&rx_pkts[pos + 2], mbp2);
 
 		if (split_packet) {
@@ -307,28 +301,9 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
 		}
 
-		/* pkt 3,4 shift the pktlen field to be 16-bit aligned*/
-		uint32x4_t len3 = vshlq_u32(vreinterpretq_u32_u64(descs[3]),
-					    len_shl);
-		descs[3] = vreinterpretq_u64_u16(vsetq_lane_u16
-				(vgetq_lane_u16(vreinterpretq_u16_u32(len3), 7),
-				 vreinterpretq_u16_u64(descs[3]),
-				 7));
-		uint32x4_t len2 = vshlq_u32(vreinterpretq_u32_u64(descs[2]),
-					    len_shl);
-		descs[2] = vreinterpretq_u64_u16(vsetq_lane_u16
-				(vgetq_lane_u16(vreinterpretq_u16_u32(len2), 7),
-				 vreinterpretq_u16_u64(descs[2]),
-				 7));
-
-		/* D.1 pkt 3,4 convert format from desc to pktmbuf */
-		pkt_mb4 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[3]), shuf_msk);
-		pkt_mb3 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[2]), shuf_msk);
-
 		/* C.1 4=>2 filter staterr info only */
 		sterr_tmp2 = vzipq_u16(vreinterpretq_u16_u64(descs[1]),
 				       vreinterpretq_u16_u64(descs[3]));
-		/* C.1 4=>2 filter staterr info only */
 		sterr_tmp1 = vzipq_u16(vreinterpretq_u16_u64(descs[0]),
 				       vreinterpretq_u16_u64(descs[2]));
 
@@ -338,13 +313,19 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 
 		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
 
-		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
-		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
-		pkt_mb4 = vreinterpretq_u8_u16(tmp);
-		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
-		pkt_mb3 = vreinterpretq_u8_u16(tmp);
-
-		/* pkt 1,2 shift the pktlen field to be 16-bit aligned*/
+		/* pkts shift the pktlen field to be 16-bit aligned*/
+		uint32x4_t len3 = vshlq_u32(vreinterpretq_u32_u64(descs[3]),
+					    len_shl);
+		descs[3] = vreinterpretq_u64_u16(vsetq_lane_u16
+				(vgetq_lane_u16(vreinterpretq_u16_u32(len3), 7),
+				 vreinterpretq_u16_u64(descs[3]),
+				 7));
+		uint32x4_t len2 = vshlq_u32(vreinterpretq_u32_u64(descs[2]),
+					    len_shl);
+		descs[2] = vreinterpretq_u64_u16(vsetq_lane_u16
+				(vgetq_lane_u16(vreinterpretq_u16_u32(len2), 7),
+				 vreinterpretq_u16_u64(descs[2]),
+				 7));
 		uint32x4_t len1 = vshlq_u32(vreinterpretq_u32_u64(descs[1]),
 					    len_shl);
 		descs[1] = vreinterpretq_u64_u16(vsetq_lane_u16
@@ -358,22 +339,38 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 				 vreinterpretq_u16_u64(descs[0]),
 				 7));
 
-		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
+		/* D.1 pkts convert format from desc to pktmbuf */
+		pkt_mb4 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[3]), shuf_msk);
+		pkt_mb3 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[2]), shuf_msk);
 		pkt_mb2 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[1]), shuf_msk);
 		pkt_mb1 = vqtbl1q_u8(vreinterpretq_u8_u64(descs[0]), shuf_msk);
 
-		/* D.3 copy final 3,4 data to rx_pkts */
-		vst1q_u8((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
-				 pkt_mb4);
-		vst1q_u8((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
-				 pkt_mb3);
-
-		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
+		/* D.2 pkts set in_port/nb_seg and remove crc */
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb4), crc_adjust);
+		pkt_mb4 = vreinterpretq_u8_u16(tmp);
+		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb3), crc_adjust);
+		pkt_mb3 = vreinterpretq_u8_u16(tmp);
 		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb2), crc_adjust);
 		pkt_mb2 = vreinterpretq_u8_u16(tmp);
 		tmp = vsubq_u16(vreinterpretq_u16_u8(pkt_mb1), crc_adjust);
 		pkt_mb1 = vreinterpretq_u8_u16(tmp);
 
+		/* D.3 copy final data to rx_pkts */
+		vst1q_u8((void *)&rx_pkts[pos + 3]->rx_descriptor_fields1,
+				pkt_mb4);
+		vst1q_u8((void *)&rx_pkts[pos + 2]->rx_descriptor_fields1,
+				pkt_mb3);
+		vst1q_u8((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
+				pkt_mb2);
+		vst1q_u8((void *)&rx_pkts[pos]->rx_descriptor_fields1,
+				pkt_mb1);
+
+		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
+
+		if (likely(pos + RTE_I40E_DESCS_PER_LOOP < nb_pkts)) {
+			rte_prefetch_non_temporal(rxdp + RTE_I40E_DESCS_PER_LOOP);
+		}
+
 		/* C* extract and record EOP bit */
 		if (split_packet) {
 			uint8x16_t eop_shuf_mask = {
@@ -411,14 +408,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 					    I40E_UINT16_BIT - 1));
 		stat = ~vgetq_lane_u64(vreinterpretq_u64_u16(staterr), 0);
 
-		rte_prefetch_non_temporal(rxdp + RTE_I40E_DESCS_PER_LOOP);
-
-		/* D.3 copy final 1,2 data to rx_pkts */
-		vst1q_u8((void *)&rx_pkts[pos + 1]->rx_descriptor_fields1,
-			 pkt_mb2);
-		vst1q_u8((void *)&rx_pkts[pos]->rx_descriptor_fields1,
-			 pkt_mb1);
-		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
 		/* C.4 calc avaialbe number of desc */
 		if (unlikely(stat == 0)) {
 			nb_pkts_recd += RTE_I40E_DESCS_PER_LOOP;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [dpdk-dev] [PATCH v1 4/4] net/i40e: change code order to reduce L1 cache misses
  2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
                   ` (2 preceding siblings ...)
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 3/4] net/i40e: reorder Rx NEON code for better readability Feifei Wang
@ 2021-07-23  3:10 ` Feifei Wang
  2021-08-10  3:00 ` [dpdk-dev] [PATCH v1 0/4] fix note error Zhang, Qi Z
  4 siblings, 0 replies; 6+ messages in thread
From: Feifei Wang @ 2021-07-23  3:10 UTC (permalink / raw)
  To: Ruifeng Wang, Beilei Xing; +Cc: dev, nd, Feifei Wang

For N1 platform, packet mbuf load and descs load are hot spots to limit
the performance for "desc_to_ptype_v" and "desc_to_olflags_v" functions
in i40e rx NEON path. This is because packet mbuf and descs are evicted
from l1d-cache to l2d-cache.

To reduce l1d-cache-misses and improve the performance, change the code
order and move "desc_to_ptype_v" and "desc_to_olflags_v" functions
forward to the location, where packet mbuf and descs are just loaded.

Test Result:
dpdk:21.08-rc1
gcc-9
For n1sdp, the patch improves the performance by 1.8%.
For thunderx2, no performance changes.

Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/i40e/i40e_rxtx_vec_neon.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 8f3188e910..b2683fda60 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -301,18 +301,6 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 			rte_mbuf_prefetch_part2(rx_pkts[pos + 3]);
 		}
 
-		/* C.1 4=>2 filter staterr info only */
-		sterr_tmp2 = vzipq_u16(vreinterpretq_u16_u64(descs[1]),
-				       vreinterpretq_u16_u64(descs[3]));
-		sterr_tmp1 = vzipq_u16(vreinterpretq_u16_u64(descs[0]),
-				       vreinterpretq_u16_u64(descs[2]));
-
-		/* C.2 get 4 pkts staterr value  */
-		staterr = vzipq_u16(sterr_tmp1.val[1],
-				    sterr_tmp2.val[1]).val[0];
-
-		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
-
 		/* pkts shift the pktlen field to be 16-bit aligned*/
 		uint32x4_t len3 = vshlq_u32(vreinterpretq_u32_u64(descs[3]),
 					    len_shl);
@@ -367,10 +355,22 @@ _recv_raw_pkts_vec(struct i40e_rx_queue *__rte_restrict rxq,
 
 		desc_to_ptype_v(descs, &rx_pkts[pos], ptype_tbl);
 
+		desc_to_olflags_v(rxq, descs, &rx_pkts[pos]);
+
 		if (likely(pos + RTE_I40E_DESCS_PER_LOOP < nb_pkts)) {
 			rte_prefetch_non_temporal(rxdp + RTE_I40E_DESCS_PER_LOOP);
 		}
 
+		/* C.1 4=>2 filter staterr info only */
+		sterr_tmp2 = vzipq_u16(vreinterpretq_u16_u64(descs[1]),
+				       vreinterpretq_u16_u64(descs[3]));
+		sterr_tmp1 = vzipq_u16(vreinterpretq_u16_u64(descs[0]),
+				       vreinterpretq_u16_u64(descs[2]));
+
+		/* C.2 get 4 pkts staterr value  */
+		staterr = vzipq_u16(sterr_tmp1.val[1],
+				    sterr_tmp2.val[1]).val[0];
+
 		/* C* extract and record EOP bit */
 		if (split_packet) {
 			uint8x16_t eop_shuf_mask = {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [dpdk-dev] [PATCH v1 0/4] fix note error
  2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
                   ` (3 preceding siblings ...)
  2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 4/4] net/i40e: change code order to reduce L1 cache misses Feifei Wang
@ 2021-08-10  3:00 ` Zhang, Qi Z
  4 siblings, 0 replies; 6+ messages in thread
From: Zhang, Qi Z @ 2021-08-10  3:00 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Feifei Wang
> Sent: Friday, July 23, 2021 11:11 AM
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>
> Subject: [dpdk-dev] [PATCH v1 0/4] fix note error
> 
> Fix drivers/net note error and do some optimization for i40e NEON path.
> 
> Feifei Wang (4):
>   drivers/net: remove redundant phrases
>   drivers/net: fix note error for Rx vector
>   net/i40e: reorder Rx NEON code for better readability
>   net/i40e: change code order to reduce L1 cache misses
> 
>  drivers/net/fm10k/fm10k_rxtx_vec.c       |   6 +-
>  drivers/net/i40e/i40e_rxtx_vec_altivec.c |  10 +--
>  drivers/net/i40e/i40e_rxtx_vec_neon.c    | 101 ++++++++++-------------
>  drivers/net/i40e/i40e_rxtx_vec_sse.c     |   6 +-
>  drivers/net/iavf/iavf_rxtx_vec_sse.c     |  12 +--
>  drivers/net/ice/ice_rxtx_vec_sse.c       |   6 +-
>  drivers/net/ixgbe/ixgbe_rxtx_vec_sse.c   |   6 +-
>  7 files changed, 68 insertions(+), 79 deletions(-)
> 
> --
> 2.25.1

Applied to dpdk-next-net-intel.

Thanks
Qi


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-08-10  3:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-23  3:10 [dpdk-dev] [PATCH v1 0/4] fix note error Feifei Wang
2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 1/4] drivers/net: remove redundant phrases Feifei Wang
2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 2/4] drivers/net: fix note error for Rx vector Feifei Wang
2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 3/4] net/i40e: reorder Rx NEON code for better readability Feifei Wang
2021-07-23  3:10 ` [dpdk-dev] [PATCH v1 4/4] net/i40e: change code order to reduce L1 cache misses Feifei Wang
2021-08-10  3:00 ` [dpdk-dev] [PATCH v1 0/4] fix note error Zhang, Qi Z

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).