DPDK patches and discussions
 help / color / mirror / Atom feed
From: Ruifeng Wang <ruifeng.wang@arm.com>
To: ajit.khaparde@broadcom.com, somnath.kotur@broadcom.com
Cc: dev@dpdk.org, honnappa.nagarahalli@arm.com, nd@arm.com,
	Ruifeng Wang <ruifeng.wang@arm.com>
Subject: [PATCH] net/bnxt: reduce barriers in NEON vector Rx
Date: Mon, 13 Jun 2022 14:22:25 +0800	[thread overview]
Message-ID: <20220613062225.2317537-1-ruifeng.wang@arm.com> (raw)

To read descriptors in expected order, barriers are inserted after each
descriptor read. The excessive use of barriers is unnecessary and could
cause performance drop.

Removed barriers between descriptor reads. And changed counting of valid
packets so as to handle discontinuous valid packets. Because out of
order read could lead to valid descriptors that fetched being
discontinuous.

In VPP L3 routing test, 6% performance gain was observed. The test was
done on a platform with ThunderX2 CPU and Broadcom PS225 NIC.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/bnxt/bnxt_rxtx_vec_neon.c | 47 ++++++++++++++-------------
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c
index 32f8e59b3a..6a4ece681b 100644
--- a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c
+++ b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c
@@ -235,34 +235,32 @@ recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		 * IO barriers are used to ensure consistent state.
 		 */
 		rxcmp1[3] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 7]);
-		rte_io_rmb();
+		rxcmp1[2] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 5]);
+		rxcmp1[1] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 3]);
+		rxcmp1[0] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 1]);
+
+		/* Use acquire fence to order loads of descriptor words. */
+		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
 		/* Reload lower 64b of descriptors to make it ordered after info3_v. */
 		rxcmp1[3] = vreinterpretq_u32_u64(vld1q_lane_u64
 				((void *)&cpr->cp_desc_ring[cons + 7],
 				vreinterpretq_u64_u32(rxcmp1[3]), 0));
-		rxcmp[3] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 6]);
-
-		rxcmp1[2] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 5]);
-		rte_io_rmb();
 		rxcmp1[2] = vreinterpretq_u32_u64(vld1q_lane_u64
 				((void *)&cpr->cp_desc_ring[cons + 5],
 				vreinterpretq_u64_u32(rxcmp1[2]), 0));
-		rxcmp[2] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 4]);
-
-		t1 = vreinterpretq_u64_u32(vzip2q_u32(rxcmp1[2], rxcmp1[3]));
-
-		rxcmp1[1] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 3]);
-		rte_io_rmb();
 		rxcmp1[1] = vreinterpretq_u32_u64(vld1q_lane_u64
 				((void *)&cpr->cp_desc_ring[cons + 3],
 				vreinterpretq_u64_u32(rxcmp1[1]), 0));
-		rxcmp[1] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 2]);
-
-		rxcmp1[0] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 1]);
-		rte_io_rmb();
 		rxcmp1[0] = vreinterpretq_u32_u64(vld1q_lane_u64
 				((void *)&cpr->cp_desc_ring[cons + 1],
 				vreinterpretq_u64_u32(rxcmp1[0]), 0));
+
+		rxcmp[3] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 6]);
+		rxcmp[2] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 4]);
+
+		t1 = vreinterpretq_u64_u32(vzip2q_u32(rxcmp1[2], rxcmp1[3]));
+
+		rxcmp[1] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 2]);
 		rxcmp[0] = vld1q_u32((void *)&cpr->cp_desc_ring[cons + 0]);
 
 		t0 = vreinterpretq_u64_u32(vzip2q_u32(rxcmp1[0], rxcmp1[1]));
@@ -278,16 +276,21 @@ recv_burst_vec_neon(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		 * bits and count the number of set bits in order to determine
 		 * the number of valid descriptors.
 		 */
-		valid = vget_lane_u64(vreinterpret_u64_u16(vqmovn_u32(info3_v)),
-				      0);
+		valid = vget_lane_u64(vreinterpret_u64_s16(vshr_n_s16
+				(vreinterpret_s16_u16(vshl_n_u16
+				(vqmovn_u32(info3_v), 15)), 15)), 0);
+
 		/*
 		 * At this point, 'valid' is a 64-bit value containing four
-		 * 16-bit fields, each of which is either 0x0001 or 0x0000.
-		 * Compute number of valid descriptors from the index of
-		 * the highest non-zero field.
+		 * 16-bit fields, each of which is either 0xffff or 0x0000.
+		 * Count the number of consecutive 1s from LSB in order to
+		 * determine the number of valid descriptors.
 		 */
-		num_valid = (sizeof(uint64_t) / sizeof(uint16_t)) -
-				(__builtin_clzl(valid & desc_valid_mask) / 16);
+		valid = ~(valid & desc_valid_mask);
+		if (valid == 0)
+			num_valid = 4;
+		else
+			num_valid = __builtin_ctzl(valid) / 16;
 
 		if (num_valid == 0)
 			break;
-- 
2.25.1


             reply	other threads:[~2022-06-13  6:22 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-13  6:22 Ruifeng Wang [this message]
2022-06-26 20:44 ` Ajit Khaparde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220613062225.2317537-1-ruifeng.wang@arm.com \
    --to=ruifeng.wang@arm.com \
    --cc=ajit.khaparde@broadcom.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=nd@arm.com \
    --cc=somnath.kotur@broadcom.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).