DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64
@ 2020-02-13 12:38 Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 1/6] net/mlx5: relax the barrier for UAR write Gavin Hu
                   ` (13 more replies)
  0 siblings, 14 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

Using just sufficient barriers really matters to performance.
Insufficient barriers will cause issues while barriers stronger
than required, especially in the fast path is a performance killer.

In the joint preliminary testing between Arm and Ampere, 8%~13%
performance was measured.

Gavin Hu (5):
  net/mlx5: relax the barrier for UAR write
  net/mlx5: use cio barrier before the BF WQE
  net/mlx5: add missing barrier
  net/mlx5: add descriptive comment for a barrier
  net/mlx5: non-cacheable mapping defaulted for aarch64

Phil Yang (1):
  net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt

 drivers/net/mlx5/mlx5_rxq.c  |  5 +++--
 drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
 drivers/net/mlx5/mlx5_rxtx.h | 11 ++++++++---
 drivers/net/mlx5/mlx5_txq.c  |  4 ++++
 4 files changed, 24 insertions(+), 12 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 1/6] net/mlx5: relax the barrier for UAR write
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 2/6] net/mlx5: use cio barrier before the BF WQE Gavin Hu
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

The UAR is part of PCI address space that is mapped for direct access to
the HCA from the CPU. Read-Write accesses to this space are strongly
ordered thus a compiler barrier is sufficient for all arches.

This patch set is based on the following aarch64 architecural facts:
1. The PCI BAR space is mapped as nGnRE device memory, not cachable nor
write-combine.
2. io accesses to a single device is total ordered.

Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 939778aa5..50b3cc3c9 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -546,7 +546,7 @@ __mlx5_uar_write64_relaxed(uint64_t val, void *addr,
 static __rte_always_inline void
 __mlx5_uar_write64(uint64_t val, void *addr, rte_spinlock_t *lock)
 {
-	rte_io_wmb();
+	rte_compiler_barrier();
 	__mlx5_uar_write64_relaxed(val, addr, lock);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 2/6] net/mlx5: use cio barrier before the BF WQE
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 1/6] net/mlx5: relax the barrier for UAR write Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 3/6] net/mlx5: add missing barrier Gavin Hu
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

To ensure the WQE and doorbell record, which reside in the host memory,
are visible to HW before the blue frame, a CIO barrier is sufficient, a
rte_wmb is overkill.

Fixes: 6cb559d67b83 ("net/mlx5: add vectorized Rx/Tx burst for x86")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 50b3cc3c9..c672af4c4 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -658,7 +658,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	rte_cio_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
-	rte_wmb();
+	rte_cio_wmb();
 	mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock);
 	if (cond)
 		rte_wmb();
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 3/6] net/mlx5: add missing barrier
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 1/6] net/mlx5: relax the barrier for UAR write Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 2/6] net/mlx5: use cio barrier before the BF WQE Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 4/6] net/mlx5: add descriptive comment for a barrier Gavin Hu
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

To keep order of the modification of RX queue descriptor(rxq->cq_db) and
the CQ doorbell register, a rte_cio_wmb barrier is required.

The situation was rescued by the stronger than required barrier in the
mlx5_uar_write64, it becomes a must when the barrier is relaxed.

Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems")
Cc: stable@dpdk.org

Suggested-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index dc0fd8211..2d1b153a3 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -856,7 +856,8 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 	doorbell = (uint64_t)doorbell_hi << 32;
 	doorbell |=  rxq->cqn;
 	rxq->cq_db[MLX5_CQ_ARM_DB] = rte_cpu_to_be_32(doorbell_hi);
-	mlx5_uar_write64(rte_cpu_to_be_64(doorbell),
+	rte_cio_wmb();
+	mlx5_uar_write64_relaxed(rte_cpu_to_be_64(doorbell),
 			 cq_db_reg, rxq->uar_lock_cq);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 4/6] net/mlx5: add descriptive comment for a barrier
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (2 preceding siblings ...)
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 3/6] net/mlx5: add missing barrier Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 5/6] net/mlx5: non-cacheable mapping defaulted for aarch64 Gavin Hu
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

The barrier is not required or can be moved down if HW waits for the
doorbell ring to execute the WQE.

This is not the case as HW can start executing the WQE until it gets the
ownership(passed by SW writing the doorbell record).

Add a decriptive comment for this HW specific behavior.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index c672af4c4..d32c4e430 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -655,6 +655,11 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	uint64_t *dst = MLX5_TX_BFREG(txq);
 	volatile uint64_t *src = ((volatile uint64_t *)wqe);
 
+	/* The ownership of WQE is passed to HW by updating the doorbell
+	 * record. Once WQE ownership has been passed to the HCA, HW can
+	 * execute the WQE. The barrier is to ensure the WQE are visible
+	 * to HW before HW execute it.
+	 */
 	rte_cio_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 5/6] net/mlx5: non-cacheable mapping defaulted for aarch64
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (3 preceding siblings ...)
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 4/6] net/mlx5: add descriptive comment for a barrier Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 6/6] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt Gavin Hu
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

aarch64 does not map pci resources to 'write-combine' nor
cacheable. In Linux Kernel arch_can_pci_mmap_wc() equals to 0 on
aarch64[1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
tree/drivers/pci/pci-sysfs.c?h=v5.4#n1204

Fixes: f078ceb6ae93 ("net/mlx5: fix Tx doorbell write memory barrier")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_txq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index bc13abfe6..144bab4a6 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -319,7 +319,11 @@ txq_uar_ncattr_init(struct mlx5_txq_ctrl *txq_ctrl, size_t page_size)
 	off_t cmd;
 
 	txq_ctrl->txq.db_heu = priv->config.dbnc == MLX5_TXDB_HEURISTIC;
+#ifdef RTE_ARCH_ARM64
+	txq_ctrl->txq.db_nc = 1;
+#else
 	txq_ctrl->txq.db_nc = 0;
+#endif
 	/* Check the doorbell register mapping type. */
 	cmd = txq_ctrl->uar_mmap_offset / page_size;
 	cmd >>= MLX5_UAR_MMAP_CMD_SHIFT;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v1 6/6] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (4 preceding siblings ...)
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 5/6] net/mlx5: non-cacheable mapping defaulted for aarch64 Gavin Hu
@ 2020-02-13 12:38 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-02-13 12:38 UTC (permalink / raw)
  To: dev
  Cc: nd, Phil Yang, david.marchand, thomas, rasland, drc,
	bruce.richardson, konstantin.ananyev, matan, shahafs,
	viacheslavo, jerinj, Honnappa.Nagarahalli, ruifeng.wang,
	joyce.kong, steve.capper, stable

From: Phil Yang <phil.yang@arm.com>

PMD Rx queue descriptor contains two mlx5_mprq_buf fields, which
are the multi-packet RQ buffer header pointers. It uses the common
rte_atomic_XXX functions to make sure the refcnt access is atomic.

The common rte_atomic_XXX functions are full barriers on aarch64.
Optimized it with one-way barrier to improve performance.

Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support")
Cc: stable@dpdk.org

Suggested-by: Gavin Hu <gavin.hu@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 drivers/net/mlx5/mlx5_rxq.c  |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
 drivers/net/mlx5/mlx5_rxtx.h |  2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 2d1b153a3..765bb1af5 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1535,7 +1535,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg,
 
 	memset(_m, 0, sizeof(*buf));
 	buf->mp = mp;
-	rte_atomic16_set(&buf->refcnt, 1);
+	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 	for (j = 0; j != strd_n; ++j) {
 		shinfo = &buf->shinfos[j];
 		shinfo->free_cb = mlx5_mprq_buf_free_cb;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 5eea932d4..0e7519c56 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1592,10 +1592,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
 {
 	struct mlx5_mprq_buf *buf = opaque;
 
-	if (rte_atomic16_read(&buf->refcnt) == 1) {
+	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
 		rte_mempool_put(buf->mp, buf);
-	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
-		rte_atomic16_set(&buf->refcnt, 1);
+	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
+						__ATOMIC_RELAXED) == 0)) {
+		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 		rte_mempool_put(buf->mp, buf);
 	}
 }
@@ -1676,7 +1677,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 		if (consumed_strd == strd_n) {
 			/* Replace WQE only if the buffer is still in use. */
-			if (rte_atomic16_read(&buf->refcnt) > 1) {
+			if (__atomic_load_n(&buf->refcnt,
+					    __ATOMIC_RELAXED) > 1) {
 				mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n);
 				/* Release the old buffer. */
 				mlx5_mprq_buf_free(buf);
@@ -1766,9 +1768,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			void *buf_addr;
 
 			/* Increment the refcnt of the whole chunk. */
-			rte_atomic16_add_return(&buf->refcnt, 1);
-			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <=
-				    strd_n + 1);
+			__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE);
+			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
+					__ATOMIC_RELAXED) <= strd_n + 1);
 			buf_addr = RTE_PTR_SUB(addr, headroom_sz);
 			/*
 			 * MLX5 device doesn't use iova but it is necessary in a
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index d32c4e430..1f453fe09 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -78,7 +78,7 @@ struct rxq_zip {
 /* Multi-Packet RQ buffer header. */
 struct mlx5_mprq_buf {
 	struct rte_mempool *mp;
-	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
+	uint16_t refcnt; /* Atomically accessed refcnt. */
 	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */
 	struct rte_mbuf_ext_shared_info shinfos[];
 	/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (5 preceding siblings ...)
  2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 6/6] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 17:20   ` Andrew Rybchenko
                     ` (4 more replies)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 1/7] eal: introduce new class of barriers for DMA use cases Gavin Hu
                   ` (6 subsequent siblings)
  13 siblings, 5 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

To order writes to various memory types, 'sfence' is required for x86,
and 'dmb oshst' is required for aarch64. 

But within DPDK, there is no abstracted barriers covers this
combination: sfence(x86)/dmb(aarch64).

So introduce a new barrier class - rte_dma_*mb for this combination, 

Doorbell rings are typical use cases of this new barrier class, which
requires something ready in the memory before letting HW aware.

As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while
rte_wmb is 'dsb' for aarch64.

In the joint preliminary testing between Arm and Ampere, 8%~13%
performance boost was measured.

As there is no functionality changes, it will not impact x86. 

Gavin Hu (6):
  eal: introduce new class of barriers for DMA use cases
  net/mlx5: dmb for immediate doorbell ring on aarch64
  net/mlx5: relax barrier to order UAR writes on aarch64
  net/mlx5: relax barrier for aarch64
  net/mlx5: add descriptive comment for a barrier
  doc: clarify one configuration in mlx5 guide

Phil Yang (1):
  net/mlx5: relax ordering for multi-packet RQ buffer refcnt

 doc/guides/nics/mlx5.rst                    |  6 ++--
 drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
 drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
 lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
 lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
 lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
 lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
 lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
 9 files changed, 78 insertions(+), 15 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 1/7] eal: introduce new class of barriers for DMA use cases
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (6 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 2/7] net/mlx5: dmb for immediate doorbell ring on aarch64 Gavin Hu
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

In DPDK we use rte_*mb barriers to ensure that memory accesses to DMA
regions are observed before MMIO accesses to hardware registers.

On AArch64, the rte_*mb barriers are implemented by "DSB" (Data
Synchronisation Barrier) style instructions which are the strongest
barriers possible.

Recently, however, it has been realised [1], that for devices where the
MMIO regions are shared between all CPUs, that it is possible to relax
this memory barrier.

There are cases where we wish to retain the strength of the rte_*mb
memory barriers; thus rather than relax rte_*mb we opt instead to
introduce a new class of barrier rte_dma_*mb.

For AArch64, rte_dma_*mb will be implemented by a relaxed "DMB OSH"
style of barrier.

For other architectures, we implement rte_dma_*mb as rte_*mb so this
should not result in any functional changes.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
 lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
 lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
 lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
 lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
 5 files changed, 55 insertions(+)

diff --git a/lib/librte_eal/arm/include/rte_atomic_32.h b/lib/librte_eal/arm/include/rte_atomic_32.h
index 7dc0d06d1..80208467e 100644
--- a/lib/librte_eal/arm/include/rte_atomic_32.h
+++ b/lib/librte_eal/arm/include/rte_atomic_32.h
@@ -33,6 +33,12 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_mb() rte_mb()
+
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 #define rte_cio_wmb() rte_wmb()
 
 #define rte_cio_rmb() rte_rmb()
diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..608726c29 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -37,6 +37,12 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_mb() asm volatile("dmb osh" : : : "memory")
+
+#define rte_dma_wmb() asm volatile("dmb oshst" : : : "memory")
+
+#define rte_dma_rmb() asm volatile("dmb oshld" : : : "memory")
+
 #define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
 
 #define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
diff --git a/lib/librte_eal/include/generic/rte_atomic.h b/lib/librte_eal/include/generic/rte_atomic.h
index e6ab15a97..042264c7e 100644
--- a/lib/librte_eal/include/generic/rte_atomic.h
+++ b/lib/librte_eal/include/generic/rte_atomic.h
@@ -107,6 +107,37 @@ static inline void rte_io_wmb(void);
 static inline void rte_io_rmb(void);
 ///@}
 
+/** @name DMA Memory Barrier
+ */
+///@{
+/**
+ * memory barrier for DMA use cases
+ *
+ * Guarantees that the LOAD and STORE operations that precede the rte_dma_mb()
+ * call are visible to CPU and I/O device that is shared between all CPUs
+ * before the LOAD and STORE operations that follow it.
+ */
+static inline void rte_dma_mb(void);
+
+/**
+ * Write memory barrier for DMA use cases
+ *
+ * Guarantees that the STORE operations that precede the rte_dma_wmb() call are
+ * visible to CPU and I/O device that is shared between all CPUs before the
+ * STORE operations that follow it.
+ */
+static inline void rte_dma_wmb(void);
+
+/**
+ * Read memory barrier for DMA use cases
+ *
+ * Guarantees that the LOAD operations that precede the rte_dma_rmb() call are
+ * visible to CPU and IO device that is shared between all CPUs before the LOAD
+ * operations that follow it.
+ */
+static inline void rte_dma_rmb(void);
+///@}
+
 /** @name Coherent I/O Memory Barrier
  *
  * Coherent I/O memory barrier is a lightweight version of I/O memory
diff --git a/lib/librte_eal/ppc/include/rte_atomic.h b/lib/librte_eal/ppc/include/rte_atomic.h
index 7e3e13118..faa36bb76 100644
--- a/lib/librte_eal/ppc/include/rte_atomic.h
+++ b/lib/librte_eal/ppc/include/rte_atomic.h
@@ -36,6 +36,12 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_mb() rte_mb()
+
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 #define rte_cio_wmb() rte_wmb()
 
 #define rte_cio_rmb() rte_rmb()
diff --git a/lib/librte_eal/x86/include/rte_atomic.h b/lib/librte_eal/x86/include/rte_atomic.h
index 148398f50..0b1d452f3 100644
--- a/lib/librte_eal/x86/include/rte_atomic.h
+++ b/lib/librte_eal/x86/include/rte_atomic.h
@@ -79,6 +79,12 @@ rte_smp_mb(void)
 
 #define rte_io_rmb() rte_compiler_barrier()
 
+#define rte_dma_mb() rte_mb()
+
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 #define rte_cio_wmb() rte_compiler_barrier()
 
 #define rte_cio_rmb() rte_compiler_barrier()
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 2/7] net/mlx5: dmb for immediate doorbell ring on aarch64
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (7 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 1/7] eal: introduce new class of barriers for DMA use cases Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 3/7] net/mlx5: relax barrier to order UAR writes " Gavin Hu
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

A 'DMB' is enough to evict the merge buffer on aarch64,when the doorbell
register is mapped as 'Normal-NC', the counterpart of WC on x86.

Otherwise, it is mapped as Device memory, no barriers required at all.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 939778aa5..e509f3b88 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -661,7 +661,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	rte_wmb();
 	mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock);
 	if (cond)
-		rte_wmb();
+		rte_dma_wmb();
 }
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 3/7] net/mlx5: relax barrier to order UAR writes on aarch64
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (8 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 2/7] net/mlx5: dmb for immediate doorbell ring on aarch64 Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 4/7] net/mlx5: relax barrier for aarch64 Gavin Hu
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

To order the writes to host memory and the MMIO device memory,
'DMB' is sufficient on aarch64, as a 'other-multi-copy' architecture.
'DSB' is over-killing, especially in the fast path.

Using the rte_dma_wmb can take the advantage on aarch64 while no
impacting x86 and ppc.

Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index e509f3b88..da5d81350 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -546,7 +546,7 @@ __mlx5_uar_write64_relaxed(uint64_t val, void *addr,
 static __rte_always_inline void
 __mlx5_uar_write64(uint64_t val, void *addr, rte_spinlock_t *lock)
 {
-	rte_io_wmb();
+	rte_dma_wmb();
 	__mlx5_uar_write64_relaxed(val, addr, lock);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 4/7] net/mlx5: relax barrier for aarch64
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (9 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 3/7] net/mlx5: relax barrier to order UAR writes " Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 5/7] net/mlx5: add descriptive comment for a barrier Gavin Hu
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper, stable

To ensure the WQE and doorbell record, which reside in the host memory,
are visible to HW before the blue frame, an ordered mlx5_uar_write call
is sufficient, a rte_wmb is overkill for aarch64.

Fixes: 6cb559d67b83 ("net/mlx5: add vectorized Rx/Tx burst for x86")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index da5d81350..228e37de5 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -658,8 +658,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	rte_cio_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
-	rte_wmb();
-	mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock);
+	mlx5_uar_write64(*src, dst, txq->uar_lock);
 	if (cond)
 		rte_dma_wmb();
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 5/7] net/mlx5: add descriptive comment for a barrier
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (10 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 4/7] net/mlx5: relax barrier for aarch64 Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 6/7] net/mlx5: relax ordering for multi-packet RQ buffer refcnt Gavin Hu
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 7/7] doc: clarify one configuration in mlx5 guide Gavin Hu
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

The barrier is not required or can be moved down if HW waits for the
doorbell ring to execute the WQE.

This is not the case as HW can start executing the WQE until it gets the
ownership(passed by SW writing the doorbell record).

Add a decriptive comment for this HW specific behavior.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 228e37de5..737d5716d 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -655,6 +655,11 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	uint64_t *dst = MLX5_TX_BFREG(txq);
 	volatile uint64_t *src = ((volatile uint64_t *)wqe);
 
+	/* The ownership of WQE is passed to HW by updating the doorbell
+	 * record. Once WQE ownership has been passed to the HCA, HW can
+	 * execute the WQE. The barrier is to ensure the WQE are visible
+	 * to HW before HW execute it.
+	 */
 	rte_cio_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 6/7] net/mlx5: relax ordering for multi-packet RQ buffer refcnt
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (11 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 5/7] net/mlx5: add descriptive comment for a barrier Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  2020-06-23  8:26   ` [dpdk-dev] [PATCH v3] net/mlx5: relaxed " Phil Yang
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 7/7] doc: clarify one configuration in mlx5 guide Gavin Hu
  13 siblings, 1 reply; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, Phil Yang, david.marchand, thomas, rasland, drc,
	bruce.richardson, konstantin.ananyev, matan, shahafs,
	viacheslavo, jerinj, Honnappa.Nagarahalli, ruifeng.wang,
	joyce.kong, steve.capper, stable

From: Phil Yang <phil.yang@arm.com>

PMD Rx queue descriptor contains two mlx5_mprq_buf fields, which
are the multi-packet RQ buffer header pointers. It uses the common
rte_atomic_XXX functions to make sure the refcnt access is atomic.

The common rte_atomic_XXX functions are full barriers on aarch64.
Optimized it with one-way barrier to improve performance.

Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support")
Cc: stable@dpdk.org

Suggested-by: Gavin Hu <gavin.hu@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 drivers/net/mlx5/mlx5_rxq.c  |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
 drivers/net/mlx5/mlx5_rxtx.h |  2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 8a6b410ef..834057c3b 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1539,7 +1539,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg,
 
 	memset(_m, 0, sizeof(*buf));
 	buf->mp = mp;
-	rte_atomic16_set(&buf->refcnt, 1);
+	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 	for (j = 0; j != strd_n; ++j) {
 		shinfo = &buf->shinfos[j];
 		shinfo->free_cb = mlx5_mprq_buf_free_cb;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index f3bf76376..039dd0a05 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1592,10 +1592,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
 {
 	struct mlx5_mprq_buf *buf = opaque;
 
-	if (rte_atomic16_read(&buf->refcnt) == 1) {
+	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
 		rte_mempool_put(buf->mp, buf);
-	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
-		rte_atomic16_set(&buf->refcnt, 1);
+	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
+						__ATOMIC_RELAXED) == 0)) {
+		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 		rte_mempool_put(buf->mp, buf);
 	}
 }
@@ -1676,7 +1677,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 		if (consumed_strd == strd_n) {
 			/* Replace WQE only if the buffer is still in use. */
-			if (rte_atomic16_read(&buf->refcnt) > 1) {
+			if (__atomic_load_n(&buf->refcnt,
+					    __ATOMIC_RELAXED) > 1) {
 				mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n);
 				/* Release the old buffer. */
 				mlx5_mprq_buf_free(buf);
@@ -1766,9 +1768,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			void *buf_addr;
 
 			/* Increment the refcnt of the whole chunk. */
-			rte_atomic16_add_return(&buf->refcnt, 1);
-			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <=
-				    strd_n + 1);
+			__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE);
+			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
+					__ATOMIC_RELAXED) <= strd_n + 1);
 			buf_addr = RTE_PTR_SUB(addr, headroom_sz);
 			/*
 			 * MLX5 device doesn't use iova but it is necessary in a
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 737d5716d..d0a1bffa5 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -78,7 +78,7 @@ struct rxq_zip {
 /* Multi-Packet RQ buffer header. */
 struct mlx5_mprq_buf {
 	struct rte_mempool *mp;
-	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
+	uint16_t refcnt; /* Atomically accessed refcnt. */
 	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */
 	struct rte_mbuf_ext_shared_info shinfos[];
 	/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH RFC v2 7/7] doc: clarify one configuration in mlx5 guide
  2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
                   ` (12 preceding siblings ...)
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 6/7] net/mlx5: relax ordering for multi-packet RQ buffer refcnt Gavin Hu
@ 2020-04-10 16:41 ` Gavin Hu
  13 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-10 16:41 UTC (permalink / raw)
  To: dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

The 'tx_db_nc' is used to differntiate two mapping types, WC and non-WC,
both are actually non-cacheable.

The Write-Combining on x86, is not-cacheablei. The Normal-NC, the
counterpart on aarch64, is non-cacheable too, as its name suggests, the
cache hierarchy was bypassed for accesses to these two memory regions.

Interpreting it to 'non-cacheable' makes no distinction and is
confusing.

re-interprete it to 'non-combining' can make the distinction.

Also explains why aarch64 default setting for this is different.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/nics/mlx5.rst | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index afd11cd83..addec9f12 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -610,9 +610,9 @@ Run-time configuration
   The rdma core library can map doorbell register in two ways, depending on the
   environment variable "MLX5_SHUT_UP_BF":
 
-  - As regular cached memory (usually with write combining attribute), if the
+  - As regular memory (usually with write combining attribute), if the
     variable is either missing or set to zero.
-  - As non-cached memory, if the variable is present and set to not "0" value.
+  - As non-combining memory, if the variable is present and set to not "0" value.
 
   The type of mapping may slightly affect the Tx performance, the optimal choice
   is strongly relied on the host architecture and should be deduced practically.
@@ -638,6 +638,8 @@ Run-time configuration
   If ``tx_db_nc`` is omitted or set to zero, the preset (if any) environment
   variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF",
   the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others.
+  ARM64 is different because it has a gathering buffer, the enforced barrier
+  can evict the doorbell ring immediately.
 
 - ``tx_vec_en`` parameter [int]
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
@ 2020-04-10 17:20   ` Andrew Rybchenko
  2020-04-11  3:46     ` Gavin Hu
  2020-05-11 18:06   ` [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 70+ messages in thread
From: Andrew Rybchenko @ 2020-04-10 17:20 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa.Nagarahalli, ruifeng.wang, phil.yang, joyce.kong,
	steve.capper

On 4/10/20 7:41 PM, Gavin Hu wrote:
> To order writes to various memory types, 'sfence' is required for x86,
> and 'dmb oshst' is required for aarch64.
> 
> But within DPDK, there is no abstracted barriers covers this
> combination: sfence(x86)/dmb(aarch64).
> 
> So introduce a new barrier class - rte_dma_*mb for this combination,
> 
> Doorbell rings are typical use cases of this new barrier class, which
> requires something ready in the memory before letting HW aware.
> 
> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while
> rte_wmb is 'dsb' for aarch64.

As far as I can see rte_cio_wmb() is exactly definition of the barrier
to be used for doorbells. Am I missing something?
May be it is just a bug in rte_cio_wmb() on x86?

> In the joint preliminary testing between Arm and Ampere, 8%~13%
> performance boost was measured.
> 
> As there is no functionality changes, it will not impact x86.
> 
> Gavin Hu (6):
>    eal: introduce new class of barriers for DMA use cases
>    net/mlx5: dmb for immediate doorbell ring on aarch64
>    net/mlx5: relax barrier to order UAR writes on aarch64
>    net/mlx5: relax barrier for aarch64
>    net/mlx5: add descriptive comment for a barrier
>    doc: clarify one configuration in mlx5 guide
> 
> Phil Yang (1):
>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> 
>   doc/guides/nics/mlx5.rst                    |  6 ++--
>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
>   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
>   9 files changed, 78 insertions(+), 15 deletions(-)
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD
  2020-04-10 17:20   ` Andrew Rybchenko
@ 2020-04-11  3:46     ` Gavin Hu
  2020-04-13  9:51       ` Andrew Rybchenko
  0 siblings, 1 reply; 70+ messages in thread
From: Gavin Hu @ 2020-04-11  3:46 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa Nagarahalli, Ruifeng Wang, Phil Yang, Joyce Kong,
	Steve Capper, nd

Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Saturday, April 11, 2020 1:21 AM
> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
> jerinj@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
> use it for mlx5 PMD
> 
> On 4/10/20 7:41 PM, Gavin Hu wrote:
> > To order writes to various memory types, 'sfence' is required for x86,
> > and 'dmb oshst' is required for aarch64.
> >
> > But within DPDK, there is no abstracted barriers covers this
> > combination: sfence(x86)/dmb(aarch64).
> >
> > So introduce a new barrier class - rte_dma_*mb for this combination,
> >
> > Doorbell rings are typical use cases of this new barrier class, which
> > requires something ready in the memory before letting HW aware.
> >
> > As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
> while
> > rte_wmb is 'dsb' for aarch64.
> 
> As far as I can see rte_cio_wmb() is exactly definition of the barrier
> to be used for doorbells. Am I missing something?

I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device.
rte_io_wmb is for MMIO regions. 
We are missing the barriers for various memory types, eg. Doorbell cases.

There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?)
http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124
> May be it is just a bug in rte_cio_wmb() on x86?
rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types.
DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device.
Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86.
http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087
/Gavin
> 
> > In the joint preliminary testing between Arm and Ampere, 8%~13%
> > performance boost was measured.
> >
> > As there is no functionality changes, it will not impact x86.
> >
> > Gavin Hu (6):
> >    eal: introduce new class of barriers for DMA use cases
> >    net/mlx5: dmb for immediate doorbell ring on aarch64
> >    net/mlx5: relax barrier to order UAR writes on aarch64
> >    net/mlx5: relax barrier for aarch64
> >    net/mlx5: add descriptive comment for a barrier
> >    doc: clarify one configuration in mlx5 guide
> >
> > Phil Yang (1):
> >    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> >
> >   doc/guides/nics/mlx5.rst                    |  6 ++--
> >   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
> >   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
> >   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
> >   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
> >   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
> >   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
> >   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
> >   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
> >   9 files changed, 78 insertions(+), 15 deletions(-)
> >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD
  2020-04-11  3:46     ` Gavin Hu
@ 2020-04-13  9:51       ` Andrew Rybchenko
  2020-04-13 16:46         ` Gavin Hu
  0 siblings, 1 reply; 70+ messages in thread
From: Andrew Rybchenko @ 2020-04-13  9:51 UTC (permalink / raw)
  To: Gavin Hu, dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa Nagarahalli, Ruifeng Wang, Phil Yang, Joyce Kong,
	Steve Capper

On 4/11/20 6:46 AM, Gavin Hu wrote:
> Hi Andrew,
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>> Sent: Saturday, April 11, 2020 1:21 AM
>> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
>> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
>> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
>> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
>> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
>> jerinj@marvell.com; Honnappa Nagarahalli
>> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
>> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
>> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
>> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
>> use it for mlx5 PMD
>>
>> On 4/10/20 7:41 PM, Gavin Hu wrote:
>>> To order writes to various memory types, 'sfence' is required for x86,
>>> and 'dmb oshst' is required for aarch64.
>>>
>>> But within DPDK, there is no abstracted barriers covers this
>>> combination: sfence(x86)/dmb(aarch64).
>>>
>>> So introduce a new barrier class - rte_dma_*mb for this combination,
>>>
>>> Doorbell rings are typical use cases of this new barrier class, which
>>> requires something ready in the memory before letting HW aware.
>>>
>>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
>> while
>>> rte_wmb is 'dsb' for aarch64.
>>
>> As far as I can see rte_cio_wmb() is exactly definition of the barrier
>> to be used for doorbells. Am I missing something?
> 
> I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device.
> rte_io_wmb is for MMIO regions. 
> We are missing the barriers for various memory types, eg. Doorbell cases.

When the patch series is applied, we'll have 5 types of memory
barriers: regular, smp, cio, io, dma. Do we really need so
many? May be we need a table in description which could
help to make the right choice. I.e. type of access on both
axis and type of barrier to use on intersection.

> There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?)
> http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124
>> May be it is just a bug in rte_cio_wmb() on x86?
> rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types.
> DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device.
> Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86.
> http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087

May be my problem that I don't fully understand real-life
usecases when cio should be used in accordance with its
current definition. Does it make sense without doorbell?
Does HW polling via DMA?

Thanks for explanations,
Andrew.

>>
>>> In the joint preliminary testing between Arm and Ampere, 8%~13%
>>> performance boost was measured.
>>>
>>> As there is no functionality changes, it will not impact x86.
>>>
>>> Gavin Hu (6):
>>>    eal: introduce new class of barriers for DMA use cases
>>>    net/mlx5: dmb for immediate doorbell ring on aarch64
>>>    net/mlx5: relax barrier to order UAR writes on aarch64
>>>    net/mlx5: relax barrier for aarch64
>>>    net/mlx5: add descriptive comment for a barrier
>>>    doc: clarify one configuration in mlx5 guide
>>>
>>> Phil Yang (1):
>>>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
>>>
>>>   doc/guides/nics/mlx5.rst                    |  6 ++--
>>>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
>>>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
>>>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
>>>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
>>>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
>>>   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
>>>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
>>>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
>>>   9 files changed, 78 insertions(+), 15 deletions(-)
>>>
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD
  2020-04-13  9:51       ` Andrew Rybchenko
@ 2020-04-13 16:46         ` Gavin Hu
  0 siblings, 0 replies; 70+ messages in thread
From: Gavin Hu @ 2020-04-13 16:46 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: nd, david.marchand, thomas, rasland, drc, bruce.richardson,
	konstantin.ananyev, matan, shahafs, viacheslavo, jerinj,
	Honnappa Nagarahalli, Ruifeng Wang, Phil Yang, Joyce Kong,
	Steve Capper, nd

Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Monday, April 13, 2020 5:52 PM
> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
> jerinj@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
> use it for mlx5 PMD
> 
> On 4/11/20 6:46 AM, Gavin Hu wrote:
> > Hi Andrew,
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <arybchenko@solarflare.com>
> >> Sent: Saturday, April 11, 2020 1:21 AM
> >> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> >> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> >> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> >> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> >> matan@mellanox.com; shahafs@mellanox.com;
> viacheslavo@mellanox.com;
> >> jerinj@marvell.com; Honnappa Nagarahalli
> >> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> >> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> >> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> >> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class
> and
> >> use it for mlx5 PMD
> >>
> >> On 4/10/20 7:41 PM, Gavin Hu wrote:
> >>> To order writes to various memory types, 'sfence' is required for x86,
> >>> and 'dmb oshst' is required for aarch64.
> >>>
> >>> But within DPDK, there is no abstracted barriers covers this
> >>> combination: sfence(x86)/dmb(aarch64).
> >>>
> >>> So introduce a new barrier class - rte_dma_*mb for this combination,
> >>>
> >>> Doorbell rings are typical use cases of this new barrier class, which
> >>> requires something ready in the memory before letting HW aware.
> >>>
> >>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
> >> while
> >>> rte_wmb is 'dsb' for aarch64.
> >>
> >> As far as I can see rte_cio_wmb() is exactly definition of the barrier
> >> to be used for doorbells. Am I missing something?
> >
> > I understand rte_cio_wmb is for DMA buffers, for examples, descriptors,
> work queues, located in the host memory, but shared between CPU and IO
> device.
> > rte_io_wmb is for MMIO regions.
> > We are missing the barriers for various memory types, eg. Doorbell cases.
> 
> When the patch series is applied, we'll have 5 types of memory
> barriers: regular, smp, cio, io, dma. Do we really need so
> many? May be we need a table in description which could
> help to make the right choice. I.e. type of access on both
> axis and type of barrier to use on intersection.
Yes, good suggestion!
Actually Honnappa and I already made a table sheet for this.
Will provide it in next release!
Thanks for your opinions!
> 
> > There is an implication in the definition of rte_cio_wmb, it can not be used
> for non-coherent MMIO region(WC?)
> >
> http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/g
> eneric/rte_atomic.h#L124
> >> May be it is just a bug in rte_cio_wmb() on x86?
> > rte_cio_wmb is ok for doorbells on aarch64, but looking through the
> kernel code, 'sfence' is required for various/mixed memory types.
> > DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences
> of writes to host memory that shared by IO device.
> > Strengthening rte_cio_wmb may hurt performance, so a new barrier class
> is introduced to optimize for aarch64, in the fast path only, while not
> impacting x86.
> >
> http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1
> 087
> 
> May be my problem that I don't fully understand real-life
> usecases when cio should be used in accordance with its
> current definition. Does it make sense without doorbell?
> Does HW polling via DMA?
> 
> Thanks for explanations,
> Andrew.
> 
> >>
> >>> In the joint preliminary testing between Arm and Ampere, 8%~13%
> >>> performance boost was measured.
> >>>
> >>> As there is no functionality changes, it will not impact x86.
> >>>
> >>> Gavin Hu (6):
> >>>    eal: introduce new class of barriers for DMA use cases
> >>>    net/mlx5: dmb for immediate doorbell ring on aarch64
> >>>    net/mlx5: relax barrier to order UAR writes on aarch64
> >>>    net/mlx5: relax barrier for aarch64
> >>>    net/mlx5: add descriptive comment for a barrier
> >>>    doc: clarify one configuration in mlx5 guide
> >>>
> >>> Phil Yang (1):
> >>>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> >>>
> >>>   doc/guides/nics/mlx5.rst                    |  6 ++--
> >>>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
> >>>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
> >>>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
> >>>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
> >>>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
> >>>   lib/librte_eal/include/generic/rte_atomic.h | 31
> +++++++++++++++++++++
> >>>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
> >>>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
> >>>   9 files changed, 78 insertions(+), 15 deletions(-)
> >>>
> >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
  2020-04-10 17:20   ` Andrew Rybchenko
@ 2020-05-11 18:06   ` Honnappa Nagarahalli
  2020-05-12  6:18     ` Ruifeng Wang
  2020-06-27 19:12   ` [dpdk-dev] [PATCH v2] " Honnappa Nagarahalli
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-11 18:06 UTC (permalink / raw)
  To: dev, jerinj, hemant.agrawal, ajit.khaparde, igorch, thomas,
	viacheslavo, arybchenko, honnappa.nagarahalli
  Cc: ruifeng.wang, nd

Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
atomicity memory model.

Armv8-a memory model has been strengthened to require
other-multi-copy atomicity. This property requires memory accesses
from an observer to become visible to all other observers
simultaneously [3]. This means

a) A write arriving at an endpoint shared between multiple CPUs is
   visible to all CPUs
b) A write that is visible to all CPUs is also visible to all other
   observers in the shareability domain

This allows for using cheaper DMB instructions in the place of DSB
for devices that are visible to all CPUs (i.e. devices that DPDK
caters to).

Please refer to [1], [2] and [3] for more information.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..e406411bb 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -19,11 +19,11 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_debug.h>
 
-#define rte_mb() asm volatile("dsb sy" : : : "memory")
+#define rte_mb() asm volatile("dmb osh" : : : "memory")
 
-#define rte_wmb() asm volatile("dsb st" : : : "memory")
+#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
 
-#define rte_rmb() asm volatile("dsb ld" : : : "memory")
+#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
 
 #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
 
@@ -37,9 +37,9 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
-#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
+#define rte_cio_wmb() rte_wmb()
 
-#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
+#define rte_cio_rmb() rte_rmb()
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-11 18:06   ` [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
@ 2020-05-12  6:18     ` Ruifeng Wang
  2020-05-12  6:42       ` Jerin Jacob
  0 siblings, 1 reply; 70+ messages in thread
From: Ruifeng Wang @ 2020-05-12  6:18 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, Honnappa Nagarahalli
  Cc: nd, nd


> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Tuesday, May 12, 2020 2:07 AM
> To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> 
> Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> atomicity memory model.
> 
> Armv8-a memory model has been strengthened to require other-multi-copy
> atomicity. This property requires memory accesses from an observer to
> become visible to all other observers simultaneously [3]. This means
> 
> a) A write arriving at an endpoint shared between multiple CPUs is
>    visible to all CPUs
> b) A write that is visible to all CPUs is also visible to all other
>    observers in the shareability domain
> 
> This allows for using cheaper DMB instructions in the place of DSB for devices
> that are visible to all CPUs (i.e. devices that DPDK caters to).
> 
> Please refer to [1], [2] and [3] for more information.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> b/lib/librte_eal/arm/include/rte_atomic_64.h
> index 7b7099cdc..e406411bb 100644
> --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> @@ -19,11 +19,11 @@ extern "C" {
>  #include <rte_compat.h>
>  #include <rte_debug.h>
> 
> -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> 
> -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> 
> -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> 
>  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> 
> @@ -37,9 +37,9 @@ extern "C" {
> 
>  #define rte_io_rmb() rte_rmb()
> 
> -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> +#define rte_cio_wmb() rte_wmb()
> 
> -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> +#define rte_cio_rmb() rte_rmb()
> 
>  /*------------------------ 128 bit atomic operations -------------------------*/
> 
> --
> 2.17.1

This change showed about 7% performance gain in testpmd single core NDR test.
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-12  6:18     ` Ruifeng Wang
@ 2020-05-12  6:42       ` Jerin Jacob
  2020-05-12  8:02         ` Ruifeng Wang
  0 siblings, 1 reply; 70+ messages in thread
From: Jerin Jacob @ 2020-05-12  6:42 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Honnappa Nagarahalli, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd

On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
>
>
> > -----Original Message-----
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Sent: Tuesday, May 12, 2020 2:07 AM
> > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> > arybchenko@solarflare.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> >
> > Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> > atomicity memory model.
> >
> > Armv8-a memory model has been strengthened to require other-multi-copy
> > atomicity. This property requires memory accesses from an observer to
> > become visible to all other observers simultaneously [3]. This means
> >
> > a) A write arriving at an endpoint shared between multiple CPUs is
> >    visible to all CPUs
> > b) A write that is visible to all CPUs is also visible to all other
> >    observers in the shareability domain
> >
> > This allows for using cheaper DMB instructions in the place of DSB for devices
> > that are visible to all CPUs (i.e. devices that DPDK caters to).
> >
> > Please refer to [1], [2] and [3] for more information.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> > d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > index 7b7099cdc..e406411bb 100644
> > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > @@ -19,11 +19,11 @@ extern "C" {
> >  #include <rte_compat.h>
> >  #include <rte_debug.h>
> >
> > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> >
> > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> >
> > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> >
> >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> >
> > @@ -37,9 +37,9 @@ extern "C" {
> >
> >  #define rte_io_rmb() rte_rmb()
> >
> > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > +#define rte_cio_wmb() rte_wmb()
> >
> > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > +#define rte_cio_rmb() rte_rmb()
> >
> >  /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > --
> > 2.17.1
>
> This change showed about 7% performance gain in testpmd single core NDR test.

I am trying to understand this patch wrt DPDK current usage model?

1)  Is performance improvement due to the fact that the PMD that you
are using it for testing suppose to use existing rte_cio_* but it was
using rte_[rw]mb?
2) In my understanding :
a) CPU to CPU barrier requirements are addressed by rte_smp_*
b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb

If (c) is true then we are violating the DPDK spec with change. Right?
This change will not be required if fastpath (CPU to Device) is using
rte_cio_*. Right?



> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-12  6:42       ` Jerin Jacob
@ 2020-05-12  8:02         ` Ruifeng Wang
  2020-05-12  8:28           ` Jerin Jacob
  2020-05-12 21:44           ` Honnappa Nagarahalli
  0 siblings, 2 replies; 70+ messages in thread
From: Ruifeng Wang @ 2020-05-12  8:02 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Honnappa Nagarahalli, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd, nd


> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, May 12, 2020 2:42 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
> 
> On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com>
> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Sent: Tuesday, May 12, 2020 2:07 AM
> > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > > Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>;
> > > igorch@amazon.com; thomas@monjalon.net;
> viacheslavo@mellanox.com;
> > > arybchenko@solarflare.com; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>
> > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > >
> > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > other-multi-copy atomicity memory model.
> > >
> > > Armv8-a memory model has been strengthened to require
> > > other-multi-copy atomicity. This property requires memory accesses
> > > from an observer to become visible to all other observers
> > > simultaneously [3]. This means
> > >
> > > a) A write arriving at an endpoint shared between multiple CPUs is
> > >    visible to all CPUs
> > > b) A write that is visible to all CPUs is also visible to all other
> > >    observers in the shareability domain
> > >
> > > This allows for using cheaper DMB instructions in the place of DSB
> > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > >
> > > Please refer to [1], [2] and [3] for more information.
> > >
> > > [1]
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > ---
> > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > index 7b7099cdc..e406411bb 100644
> > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > @@ -19,11 +19,11 @@ extern "C" {
> > >  #include <rte_compat.h>
> > >  #include <rte_debug.h>
> > >
> > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > >
> > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > >
> > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > >
> > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > >
> > > @@ -37,9 +37,9 @@ extern "C" {
> > >
> > >  #define rte_io_rmb() rte_rmb()
> > >
> > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > +#define rte_cio_wmb() rte_wmb()
> > >
> > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > +#define rte_cio_rmb() rte_rmb()
> > >
> > >  /*------------------------ 128 bit atomic operations
> > > -------------------------*/
> > >
> > > --
> > > 2.17.1
> >
> > This change showed about 7% performance gain in testpmd single core
> NDR test.
> 
> I am trying to understand this patch wrt DPDK current usage model?
> 
> 1)  Is performance improvement due to the fact that the PMD that you are
> using it for testing suppose to use existing rte_cio_* but it was using
> rte_[rw]mb?

This is part of the reason. There are also cases where rte_io_* was used and can be relaxed.
Such as: http://patches.dpdk.org/patch/68162/

> 2) In my understanding :
> a) CPU to CPU barrier requirements are addressed by rte_smp_*
> b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> 
> If (c) is true then we are violating the DPDK spec with change. Right?

Developers are still required to use correct barrier APIs for different use cases.
I think this change mitigates performance penalty when non optimal barrier is used.

> This change will not be required if fastpath (CPU to Device) is using rte_cio_*.
> Right?

See 1). Correct usage of rte_cio_* is not the whole.  
For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'.

> 
> 
> 
> > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-12  8:02         ` Ruifeng Wang
@ 2020-05-12  8:28           ` Jerin Jacob
  2020-05-12 21:44           ` Honnappa Nagarahalli
  1 sibling, 0 replies; 70+ messages in thread
From: Jerin Jacob @ 2020-05-12  8:28 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: Honnappa Nagarahalli, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd

On Tue, May 12, 2020 at 1:32 PM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote:
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, May 12, 2020 2:42 PM
> > To: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> > arybchenko@solarflare.com; nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
> >
> > On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com>
> > wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Sent: Tuesday, May 12, 2020 2:07 AM
> > > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> > > > Khaparde (ajit.khaparde@broadcom.com)
> > <ajit.khaparde@broadcom.com>;
> > > > igorch@amazon.com; thomas@monjalon.net;
> > viacheslavo@mellanox.com;
> > > > arybchenko@solarflare.com; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>
> > > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > >
> > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > other-multi-copy atomicity memory model.
> > > >
> > > > Armv8-a memory model has been strengthened to require
> > > > other-multi-copy atomicity. This property requires memory accesses
> > > > from an observer to become visible to all other observers
> > > > simultaneously [3]. This means
> > > >
> > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > >    visible to all CPUs
> > > > b) A write that is visible to all CPUs is also visible to all other
> > > >    observers in the shareability domain
> > > >
> > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > >
> > > > Please refer to [1], [2] and [3] for more information.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > index 7b7099cdc..e406411bb 100644
> > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > @@ -19,11 +19,11 @@ extern "C" {
> > > >  #include <rte_compat.h>
> > > >  #include <rte_debug.h>
> > > >
> > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > >
> > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > >
> > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > >
> > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > >
> > > > @@ -37,9 +37,9 @@ extern "C" {
> > > >
> > > >  #define rte_io_rmb() rte_rmb()
> > > >
> > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > +#define rte_cio_wmb() rte_wmb()
> > > >
> > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > +#define rte_cio_rmb() rte_rmb()
> > > >
> > > >  /*------------------------ 128 bit atomic operations
> > > > -------------------------*/
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > This change showed about 7% performance gain in testpmd single core
> > NDR test.
> >
> > I am trying to understand this patch wrt DPDK current usage model?
> >
> > 1)  Is performance improvement due to the fact that the PMD that you are
> > using it for testing suppose to use existing rte_cio_* but it was using
> > rte_[rw]mb?
>
> This is part of the reason. There are also cases where rte_io_* was used and can be relaxed.
> Such as: http://patches.dpdk.org/patch/68162/
>
> > 2) In my understanding :
> > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> >
> > If (c) is true then we are violating the DPDK spec with change. Right?
>
> Developers are still required to use correct barrier APIs for different use cases.
> I think this change mitigates performance penalty when non optimal barrier is used.

But does it violate the contract?  We are using rte_[rw]mb as a low
performance/heavyweight
for all the cases. I think that is the contract to DPDK consumers. For
different requirment,
We have a specific API. IMO, It makes sense to change the fastpath code for more
fine granted barriers based on the need rather than changing the
generic one to lightweight.
i.e rte_[rw]wb is the superset that works on all cases and use
customized one for the specific
use case.

>
> > This change will not be required if fastpath (CPU to Device) is using rte_cio_*.
> > Right?
>
> See 1). Correct usage of rte_cio_* is not the whole.
> For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'.
>
> >
> >
> >
> > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-12  8:02         ` Ruifeng Wang
  2020-05-12  8:28           ` Jerin Jacob
@ 2020-05-12 21:44           ` Honnappa Nagarahalli
  2020-05-13 14:49             ` Jerin Jacob
  1 sibling, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-12 21:44 UTC (permalink / raw)
  To: Ruifeng Wang, Jerin Jacob
  Cc: dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd,
	Honnappa Nagarahalli, nd

<snip>

> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > >
> > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > other-multi-copy atomicity memory model.
> > > >
> > > > Armv8-a memory model has been strengthened to require
> > > > other-multi-copy atomicity. This property requires memory accesses
> > > > from an observer to become visible to all other observers
> > > > simultaneously [3]. This means
> > > >
> > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > >    visible to all CPUs
> > > > b) A write that is visible to all CPUs is also visible to all other
> > > >    observers in the shareability domain
> > > >
> > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > >
> > > > Please refer to [1], [2] and [3] for more information.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > ---
> > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > index 7b7099cdc..e406411bb 100644
> > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > @@ -19,11 +19,11 @@ extern "C" {
> > > >  #include <rte_compat.h>
> > > >  #include <rte_debug.h>
> > > >
> > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > >
> > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > >
> > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > >
> > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > >
> > > > @@ -37,9 +37,9 @@ extern "C" {
> > > >
> > > >  #define rte_io_rmb() rte_rmb()
> > > >
> > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > +#define rte_cio_wmb() rte_wmb()
> > > >
> > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > +#define rte_cio_rmb() rte_rmb()
> > > >
> > > >  /*------------------------ 128 bit atomic operations
> > > > -------------------------*/
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > This change showed about 7% performance gain in testpmd single core
> > NDR test.
> >
> > I am trying to understand this patch wrt DPDK current usage model?
> >
> > 1)  Is performance improvement due to the fact that the PMD that you
> > are using it for testing suppose to use existing rte_cio_* but it was
> > using rte_[rw]mb?
No, it is supposed to use rte_[rw]mb for x86.

> 
> This is part of the reason. There are also cases where rte_io_* was used and
> can be relaxed.
> Such as: http://patches.dpdk.org/patch/68162/
> 
> > 2) In my understanding :
> > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> >
> > If (c) is true then we are violating the DPDK spec with change. Right?
No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).

> 
> Developers are still required to use correct barrier APIs for different use cases.
> I think this change mitigates performance penalty when non optimal barrier is
> used.
> 
> > This change will not be required if fastpath (CPU to Device) is using
> rte_cio_*.
> > Right?
Yes. It is required when the fastpath uses rte_[rw]mb.

> 
> See 1). Correct usage of rte_cio_* is not the whole.
> For some other use cases, such as barrier between accesses of different
> memory types, we can also use lighter barrier 'dmb'.
> 
> >
> >
> >
> > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-12 21:44           ` Honnappa Nagarahalli
@ 2020-05-13 14:49             ` Jerin Jacob
  2020-05-14  1:02               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 70+ messages in thread
From: Jerin Jacob @ 2020-05-13 14:49 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Ruifeng Wang, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd, Richardson, Bruce

On Wed, May 13, 2020 at 3:14 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > > >
> > > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > > other-multi-copy atomicity memory model.
> > > > >
> > > > > Armv8-a memory model has been strengthened to require
> > > > > other-multi-copy atomicity. This property requires memory accesses
> > > > > from an observer to become visible to all other observers
> > > > > simultaneously [3]. This means
> > > > >
> > > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > > >    visible to all CPUs
> > > > > b) A write that is visible to all CPUs is also visible to all other
> > > > >    observers in the shareability domain
> > > > >
> > > > > This allows for using cheaper DMB instructions in the place of DSB
> > > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to).
> > > > >
> > > > > Please refer to [1], [2] and [3] for more information.
> > > > >
> > > > > [1]
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> > > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > ---
> > > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > index 7b7099cdc..e406411bb 100644
> > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > @@ -19,11 +19,11 @@ extern "C" {
> > > > >  #include <rte_compat.h>
> > > > >  #include <rte_debug.h>
> > > > >
> > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > > >
> > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > >
> > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > >
> > > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > > >
> > > > > @@ -37,9 +37,9 @@ extern "C" {
> > > > >
> > > > >  #define rte_io_rmb() rte_rmb()
> > > > >
> > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > > +#define rte_cio_wmb() rte_wmb()
> > > > >
> > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > > +#define rte_cio_rmb() rte_rmb()
> > > > >
> > > > >  /*------------------------ 128 bit atomic operations
> > > > > -------------------------*/
> > > > >
> > > > > --
> > > > > 2.17.1
> > > >
> > > > This change showed about 7% performance gain in testpmd single core
> > > NDR test.
> > >
> > > I am trying to understand this patch wrt DPDK current usage model?
> > >
> > > 1)  Is performance improvement due to the fact that the PMD that you
> > > are using it for testing suppose to use existing rte_cio_* but it was
> > > using rte_[rw]mb?
> No, it is supposed to use rte_[rw]mb for x86.

Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b
created for this pupose.

But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it
correct from x86 PoV?
@Ananyev, Konstantin @Richardson, Bruce ?

For x86:
#define rte_io_wmb() rte_compiler_barrier()
#define rte_io_rmb() rte_compiler_barrier()
#define rte_cio_wmb() rte_compiler_barrier()
#define rte_cio_rmb() rte_compiler_barrier()


>
> >
> > This is part of the reason. There are also cases where rte_io_* was used and
> > can be relaxed.
> > Such as: http://patches.dpdk.org/patch/68162/
> >
> > > 2) In my understanding :
> > > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_*
> > > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> > >
> > > If (c) is true then we are violating the DPDK spec with change. Right?
> No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).

Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to
have the barrier between 4 of them?

1) Global memory (BSS and Data sections), Not mapped as a hugepage.
2) Hugepage memory
3) IOVA memory
4) PCI register read/write

Do we need to worry about anything else which is specific to DSB?
example, TLB related flush etc.

If we are talking this path then rte_cio_[rw]mb() has no meaning in
DPDK as an abstraction as it was created for arm64 for this specific
purpose.
If we can meet all DPDK usecse with DMB OSH then probably we can
deprecate rte_cio_wmb to avoid confusion.

>
> >
> > Developers are still required to use correct barrier APIs for different use cases.
> > I think this change mitigates performance penalty when non optimal barrier is
> > used.
> >
> > > This change will not be required if fastpath (CPU to Device) is using
> > rte_cio_*.
> > > Right?
> Yes. It is required when the fastpath uses rte_[rw]mb.
>
> >
> > See 1). Correct usage of rte_cio_* is not the whole.
> > For some other use cases, such as barrier between accesses of different
> > memory types, we can also use lighter barrier 'dmb'.
> >
> > >
> > >
> > >
> > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > >
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a
  2020-05-13 14:49             ` Jerin Jacob
@ 2020-05-14  1:02               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-14  1:02 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ruifeng Wang, dev, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, nd, Richardson, Bruce,
	Honnappa Nagarahalli, nd

<snip>
> >
> > > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a
> > > > > >
> > > > > > Change the barrier APIs for IO to reflect that Armv8-a is
> > > > > > other-multi-copy atomicity memory model.
> > > > > >
> > > > > > Armv8-a memory model has been strengthened to require
> > > > > > other-multi-copy atomicity. This property requires memory
> > > > > > accesses from an observer to become visible to all other
> > > > > > observers simultaneously [3]. This means
> > > > > >
> > > > > > a) A write arriving at an endpoint shared between multiple CPUs is
> > > > > >    visible to all CPUs
> > > > > > b) A write that is visible to all CPUs is also visible to all other
> > > > > >    observers in the shareability domain
> > > > > >
> > > > > > This allows for using cheaper DMB instructions in the place of
> > > > > > DSB for devices that are visible to all CPUs (i.e. devices that DPDK
> caters to).
> > > > > >
> > > > > > Please refer to [1], [2] and [3] for more information.
> > > > > >
> > > > > > [1]
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
> > > > > > .git /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > ---
> > > > > >  lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > index 7b7099cdc..e406411bb 100644
> > > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > > > > > @@ -19,11 +19,11 @@ extern "C" {  #include <rte_compat.h>
> > > > > > #include <rte_debug.h>
> > > > > >
> > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> > > > > >
> > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> > > > > >
> > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> > > > > >
> > > > > >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> > > > > >
> > > > > > @@ -37,9 +37,9 @@ extern "C" {
> > > > > >
> > > > > >  #define rte_io_rmb() rte_rmb()
> > > > > >
> > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_wmb() rte_wmb()
> > > > > >
> > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : :
> > > > > > "memory")
> > > > > > +#define rte_cio_rmb() rte_rmb()
> > > > > >
> > > > > >  /*------------------------ 128 bit atomic operations
> > > > > > -------------------------*/
> > > > > >
> > > > > > --
> > > > > > 2.17.1
> > > > >
> > > > > This change showed about 7% performance gain in testpmd single
> > > > > core
> > > > NDR test.
> > > >
> > > > I am trying to understand this patch wrt DPDK current usage model?
> > > >
> > > > 1)  Is performance improvement due to the fact that the PMD that
> > > > you are using it for testing suppose to use existing rte_cio_* but
> > > > it was using rte_[rw]mb?
> > No, it is supposed to use rte_[rw]mb for x86.
> 
> Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b
> created for this pupose.
> 
> But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it correct
> from x86 PoV?
> @Ananyev, Konstantin @Richardson, Bruce ?
> 
> For x86:
> #define rte_io_wmb() rte_compiler_barrier() #define rte_io_rmb()
> rte_compiler_barrier() #define rte_cio_wmb() rte_compiler_barrier() #define
> rte_cio_rmb() rte_compiler_barrier()
We need a barrier API with 'DMB OSH*' for Arm and '*fence' for x86. My understanding is, '*fence' is required when WC memory is used in x86.

Also, from Arm architecture perspective, effectively we are saying that 'DSB' is not required for portable drivers.

> 
> 
> >
> > >
> > > This is part of the reason. There are also cases where rte_io_* was
> > > used and can be relaxed.
> > > Such as: http://patches.dpdk.org/patch/68162/
> > >
> > > > 2) In my understanding :
> > > > a) CPU to CPU barrier requirements are addressed by rte_smp_*
> > > > b) CPU to DMA/Device barrier requirements are addressed by
> > > > rte_cio_*
> > > > c) CPU to ANY(CPU or Device) are addressed by  rte_[rw]mb
> > > >
> > > > If (c) is true then we are violating the DPDK spec with change. Right?
> > No, we are not. Essentially, due to the other-multi-copy atomicity behavior
> of the architecture, we are saying 'DMB OSH*' is enough to achieve (c).
> 
> Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to
> have the barrier between 4 of them?
> 
> 1) Global memory (BSS and Data sections), Not mapped as a hugepage.
> 2) Hugepage memory
> 3) IOVA memory
> 4) PCI register read/write
> 
> Do we need to worry about anything else which is specific to DSB?
> example, TLB related flush etc.
Yes, things like TLB flush or self modifying code still need DSB. But, my understanding is we do not have such code in DPDK and such code will be platform specific.

> 
> If we are talking this path then rte_cio_[rw]mb() has no meaning in DPDK as
> an abstraction as it was created for arm64 for this specific purpose.
> If we can meet all DPDK usecse with DMB OSH then probably we can
> deprecate rte_cio_wmb to avoid confusion.
Agree, rte_cio_*mb is confusing to me. We could deprecate those.

I see Octeon TX/TX2 drivers using rte_*mb. Do you see any issues with this change in those drivers?

This is a very fundamental change, we need more feedback from others working with Arm platforms.

> 
> >
> > >
> > > Developers are still required to use correct barrier APIs for different use
> cases.
> > > I think this change mitigates performance penalty when non optimal
> > > barrier is used.
> > >
> > > > This change will not be required if fastpath (CPU to Device) is
> > > > using
> > > rte_cio_*.
> > > > Right?
> > Yes. It is required when the fastpath uses rte_[rw]mb.
> >
> > >
> > > See 1). Correct usage of rte_cio_* is not the whole.
> > > For some other use cases, such as barrier between accesses of
> > > different memory types, we can also use lighter barrier 'dmb'.
> > >
> > > >
> > > >
> > > >
> > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > >
> >

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 6/7] net/mlx5: relax ordering for multi-packet RQ buffer refcnt Gavin Hu
@ 2020-06-23  8:26   ` Phil Yang
  2020-07-13  3:02     ` Phil Yang
  2020-09-03  2:53     ` [dpdk-dev] [PATCH v4] " Phil Yang
  0 siblings, 2 replies; 70+ messages in thread
From: Phil Yang @ 2020-06-23  8:26 UTC (permalink / raw)
  To: dev; +Cc: matan, shahafs, viacheslavo, Honnappa.Nagarahalli, drc, nd

Use c11 atomics with explicit ordering instead of the rte_atomic ops
which enforce unnecessary barriers on aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
---
v3:
Split from the patchset:
http://patchwork.dpdk.org/cover/68159/

 drivers/net/mlx5/mlx5_rxq.c  |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
 drivers/net/mlx5/mlx5_rxtx.h |  2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index dda0073..7f487f1 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg,
 
 	memset(_m, 0, sizeof(*buf));
 	buf->mp = mp;
-	rte_atomic16_set(&buf->refcnt, 1);
+	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 	for (j = 0; j != strd_n; ++j) {
 		shinfo = &buf->shinfos[j];
 		shinfo->free_cb = mlx5_mprq_buf_free_cb;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index e4106bf..f0eda88 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
 {
 	struct mlx5_mprq_buf *buf = opaque;
 
-	if (rte_atomic16_read(&buf->refcnt) == 1) {
+	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
 		rte_mempool_put(buf->mp, buf);
-	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
-		rte_atomic16_set(&buf->refcnt, 1);
+	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
+					       __ATOMIC_RELAXED) == 0)) {
+		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 		rte_mempool_put(buf->mp, buf);
 	}
 }
@@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 		if (consumed_strd == strd_n) {
 			/* Replace WQE only if the buffer is still in use. */
-			if (rte_atomic16_read(&buf->refcnt) > 1) {
+			if (__atomic_load_n(&buf->refcnt,
+					    __ATOMIC_RELAXED) > 1) {
 				mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n);
 				/* Release the old buffer. */
 				mlx5_mprq_buf_free(buf);
@@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			void *buf_addr;
 
 			/* Increment the refcnt of the whole chunk. */
-			rte_atomic16_add_return(&buf->refcnt, 1);
-			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <=
-				    strd_n + 1);
+			__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE);
+			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
+				    __ATOMIC_RELAXED) <= strd_n + 1);
 			buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM);
 			/*
 			 * MLX5 device doesn't use iova but it is necessary in a
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 26621ff..0fc15f3 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -78,7 +78,7 @@ struct rxq_zip {
 /* Multi-Packet RQ buffer header. */
 struct mlx5_mprq_buf {
 	struct rte_mempool *mp;
-	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
+	uint16_t refcnt; /* Atomically accessed refcnt. */
 	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */
 	struct rte_mbuf_ext_shared_info shinfos[];
 	/*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2] eal: adjust barriers for IO on Armv8-a
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
  2020-04-10 17:20   ` Andrew Rybchenko
  2020-05-11 18:06   ` [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
@ 2020-06-27 19:12   ` Honnappa Nagarahalli
  2020-06-27 19:25     ` Honnappa Nagarahalli
  2020-07-03 18:57   ` [dpdk-dev] [PATCH v3 1/3] " Honnappa Nagarahalli
  2020-07-06 23:43   ` [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
  4 siblings, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-06-27 19:12 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
atomicity memory model.

Armv8-a memory model has been strengthened to require
other-multi-copy atomicity. This property requires memory accesses
from an observer to become visible to all other observers
simultaneously [3]. This means

a) A write arriving at an endpoint shared between multiple CPUs is
   visible to all CPUs
b) A write that is visible to all CPUs is also visible to all other
   observers in the shareability domain

This allows for using cheaper DMB instructions in the place of DSB
for devices that are visible to all CPUs (i.e. devices that DPDK
caters to).

Please refer to [1], [2] and [3] for more information.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..e42f69edc 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2020 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -19,11 +19,11 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_debug.h>
 
-#define rte_mb() asm volatile("dsb sy" : : : "memory")
+#define rte_mb() asm volatile("dmb osh" : : : "memory")
 
-#define rte_wmb() asm volatile("dsb st" : : : "memory")
+#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
 
-#define rte_rmb() asm volatile("dsb ld" : : : "memory")
+#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
 
 #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
 
@@ -37,9 +37,9 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
-#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
+#define rte_cio_wmb() rte_wmb()
 
-#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
+#define rte_cio_rmb() rte_rmb()
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2] eal: adjust barriers for IO on Armv8-a
  2020-06-27 19:12   ` [dpdk-dev] [PATCH v2] " Honnappa Nagarahalli
@ 2020-06-27 19:25     ` Honnappa Nagarahalli
  2020-06-30  5:13       ` Jerin Jacob
  0 siblings, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-06-27 19:25 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, Ruifeng Wang, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, bruce.richardson
  Cc: nd, Honnappa Nagarahalli, nd

Hi Jerin,
	You had a comment earlier about deprecating rte_cio_[rw]mb. Let me know if you are ok with this patch and I can add those changes (replace references to rte_cio_[rw]mb with rte_io_[rw]mb and a deprecation notice).

Thanks,
Honnappa

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, June 27, 2020 2:12 PM
> To: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Ajit Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>; igorch@amazon.com;
> thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; bruce.richardson@intel.com
> Cc: nd <nd@arm.com>
> Subject: [PATCH v2] eal: adjust barriers for IO on Armv8-a
> 
> Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> atomicity memory model.
> 
> Armv8-a memory model has been strengthened to require other-multi-copy
> atomicity. This property requires memory accesses from an observer to
> become visible to all other observers simultaneously [3]. This means
> 
> a) A write arriving at an endpoint shared between multiple CPUs is
>    visible to all CPUs
> b) A write that is visible to all CPUs is also visible to all other
>    observers in the shareability domain
> 
> This allows for using cheaper DMB instructions in the place of DSB for devices
> that are visible to all CPUs (i.e. devices that DPDK caters to).
> 
> Please refer to [1], [2] and [3] for more information.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id
> =22ec71615d824f4f11d38d0e55a88d8956b7e45f
> [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> b/lib/librte_eal/arm/include/rte_atomic_64.h
> index 7b7099cdc..e42f69edc 100644
> --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   * Copyright(c) 2015 Cavium, Inc
> - * Copyright(c) 2019 Arm Limited
> + * Copyright(c) 2020 Arm Limited
>   */
> 
>  #ifndef _RTE_ATOMIC_ARM64_H_
> @@ -19,11 +19,11 @@ extern "C" {
>  #include <rte_compat.h>
>  #include <rte_debug.h>
> 
> -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> 
> -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> 
> -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> 
>  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> 
> @@ -37,9 +37,9 @@ extern "C" {
> 
>  #define rte_io_rmb() rte_rmb()
> 
> -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> +#define rte_cio_wmb() rte_wmb()
> 
> -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> +#define rte_cio_rmb() rte_rmb()
> 
>  /*------------------------ 128 bit atomic operations -------------------------*/
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2] eal: adjust barriers for IO on Armv8-a
  2020-06-27 19:25     ` Honnappa Nagarahalli
@ 2020-06-30  5:13       ` Jerin Jacob
  0 siblings, 0 replies; 70+ messages in thread
From: Jerin Jacob @ 2020-06-30  5:13 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dev, Ruifeng Wang, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	igorch, thomas, viacheslavo, arybchenko, bruce.richardson, nd

On Sun, Jun 28, 2020 at 12:55 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> Hi Jerin,
>         You had a comment earlier about deprecating rte_cio_[rw]mb. Let me know if you are ok with this patch and I can add those changes (replace references to rte_cio_[rw]mb with rte_io_[rw]mb and a deprecation notice).

Acked-by: Jerin Jacob <jerinj@marvell.com> for this patch
Please send the deprecation notice for overlapping rte_cio_* for 20.11


>
> Thanks,
> Honnappa
>
> > -----Original Message-----
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Sent: Saturday, June 27, 2020 2:12 PM
> > To: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > Ruifeng Wang <Ruifeng.Wang@arm.com>; jerinj@marvell.com;
> > hemant.agrawal@nxp.com; Ajit Khaparde (ajit.khaparde@broadcom.com)
> > <ajit.khaparde@broadcom.com>; igorch@amazon.com;
> > thomas@monjalon.net; viacheslavo@mellanox.com;
> > arybchenko@solarflare.com; bruce.richardson@intel.com
> > Cc: nd <nd@arm.com>
> > Subject: [PATCH v2] eal: adjust barriers for IO on Armv8-a
> >
> > Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> > atomicity memory model.
> >
> > Armv8-a memory model has been strengthened to require other-multi-copy
> > atomicity. This property requires memory accesses from an observer to
> > become visible to all other observers simultaneously [3]. This means
> >
> > a) A write arriving at an endpoint shared between multiple CPUs is
> >    visible to all CPUs
> > b) A write that is visible to all CPUs is also visible to all other
> >    observers in the shareability domain
> >
> > This allows for using cheaper DMB instructions in the place of DSB for devices
> > that are visible to all CPUs (i.e. devices that DPDK caters to).
> >
> > Please refer to [1], [2] and [3] for more information.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id
> > =22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> > b/lib/librte_eal/arm/include/rte_atomic_64.h
> > index 7b7099cdc..e42f69edc 100644
> > --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> > @@ -1,6 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   * Copyright(c) 2015 Cavium, Inc
> > - * Copyright(c) 2019 Arm Limited
> > + * Copyright(c) 2020 Arm Limited
> >   */
> >
> >  #ifndef _RTE_ATOMIC_ARM64_H_
> > @@ -19,11 +19,11 @@ extern "C" {
> >  #include <rte_compat.h>
> >  #include <rte_debug.h>
> >
> > -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> > +#define rte_mb() asm volatile("dmb osh" : : : "memory")
> >
> > -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
> >
> > -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
> >
> >  #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
> >
> > @@ -37,9 +37,9 @@ extern "C" {
> >
> >  #define rte_io_rmb() rte_rmb()
> >
> > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> > +#define rte_cio_wmb() rte_wmb()
> >
> > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> > +#define rte_cio_rmb() rte_rmb()
> >
> >  /*------------------------ 128 bit atomic operations -------------------------*/
> >
> > --
> > 2.17.1
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] eal: adjust barriers for IO on Armv8-a
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
                     ` (2 preceding siblings ...)
  2020-06-27 19:12   ` [dpdk-dev] [PATCH v2] " Honnappa Nagarahalli
@ 2020-07-03 18:57   ` Honnappa Nagarahalli
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  2020-07-06 23:43   ` [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
  4 siblings, 2 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-03 18:57 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
atomicity memory model.

Armv8-a memory model has been strengthened to require
other-multi-copy atomicity. This property requires memory accesses
from an observer to become visible to all other observers
simultaneously [3]. This means

a) A write arriving at an endpoint shared between multiple CPUs is
   visible to all CPUs
b) A write that is visible to all CPUs is also visible to all other
   observers in the shareability domain

This allows for using cheaper DMB instructions in the place of DSB
for devices that are visible to all CPUs (i.e. devices that DPDK
caters to).

Please refer to [1], [2] and [3] for more information.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..e42f69edc 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2020 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -19,11 +19,11 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_debug.h>
 
-#define rte_mb() asm volatile("dsb sy" : : : "memory")
+#define rte_mb() asm volatile("dmb osh" : : : "memory")
 
-#define rte_wmb() asm volatile("dsb st" : : : "memory")
+#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
 
-#define rte_rmb() asm volatile("dsb ld" : : : "memory")
+#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
 
 #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
 
@@ -37,9 +37,9 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
-#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
+#define rte_cio_wmb() rte_wmb()
 
-#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
+#define rte_cio_rmb() rte_rmb()
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes
  2020-07-03 18:57   ` [dpdk-dev] [PATCH v3 1/3] " Honnappa Nagarahalli
@ 2020-07-03 18:57     ` Honnappa Nagarahalli
  2020-07-05  0:57       ` Jerin Jacob
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  1 sibling, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-03 18:57 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

Updated the use of DMB instruction in rte_*mb APIs for Armv8-a.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/rel_notes/release_20_08.rst | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index 5cbc4ce14..15c21996d 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -56,6 +56,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **rte_*mb APIs are updated to use DMB instruction.**
+
+  Armv8-a memory model has been strengthened to require other-multi-copy
+  atomicity. This allows for using DMB instruction instead of DSB for IO
+  barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB
+  instruction to reflect this.
+
 * **Updated PCAP driver.**
 
   Updated PCAP driver with new features and improvements, including:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-03 18:57   ` [dpdk-dev] [PATCH v3 1/3] " Honnappa Nagarahalli
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
@ 2020-07-03 18:57     ` Honnappa Nagarahalli
  2020-07-05  0:57       ` Jerin Jacob
                         ` (2 more replies)
  1 sibling, 3 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-03 18:57 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

rte_cio_*mb APIs will be deprecated in 20.11 release.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/rel_notes/deprecation.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index d1034f60f..59656da3d 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -40,6 +40,12 @@ Deprecation Notices
   These wrappers must be used for patches that need to be merged in 20.08
   onwards. This change will not introduce any performance degradation.
 
+* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
+  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
+  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
+  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
+  20.11 release.
+
 * igb_uio: In the view of reducing the kernel dependency from the main tree,
   as a first step, the Technical Board decided to move ``igb_uio``
   kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
@ 2020-07-05  0:57       ` Jerin Jacob
  0 siblings, 0 replies; 70+ messages in thread
From: Jerin Jacob @ 2020-07-05  0:57 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dpdk-dev, Ruifeng Wang (Arm Technology China),
	Jerin Jacob, Hemant Agrawal, Ajit Khaparde, Igor Chauskin,
	Thomas Monjalon, Slava Ovsiienko, Andrew Rybchenko, Richardson,
	Bruce, nd

On Sat, Jul 4, 2020 at 12:28 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> Updated the use of DMB instruction in rte_*mb APIs for Armv8-a.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/rel_notes/release_20_08.rst | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
> index 5cbc4ce14..15c21996d 100644
> --- a/doc/guides/rel_notes/release_20_08.rst
> +++ b/doc/guides/rel_notes/release_20_08.rst
> @@ -56,6 +56,13 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =========================================================
>
> +* **rte_*mb APIs are updated to use DMB instruction.**

IMO, It is better to change to following as the end user can ignore
parsing the description if not interested in arm64.

rte_*mb APIs are updated to use DMB instruction for Armv8-a

With  above change:
Acked-by: Jerin Jacob <jerinj@marvell.com>

> +
> +  Armv8-a memory model has been strengthened to require other-multi-copy
> +  atomicity. This allows for using DMB instruction instead of DSB for IO
> +  barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB
> +  instruction to reflect this.
> +
>  * **Updated PCAP driver.**
>
>    Updated PCAP driver with new features and improvements, including:
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
@ 2020-07-05  0:57       ` Jerin Jacob
  2020-07-07 20:19       ` Ajit Khaparde
  2020-07-08 11:05       ` Ananyev, Konstantin
  2 siblings, 0 replies; 70+ messages in thread
From: Jerin Jacob @ 2020-07-05  0:57 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dpdk-dev, Ruifeng Wang (Arm Technology China),
	Jerin Jacob, Hemant Agrawal, Ajit Khaparde, Igor Chauskin,
	Thomas Monjalon, Slava Ovsiienko, Andrew Rybchenko, Richardson,
	Bruce, nd

On Sat, Jul 4, 2020 at 12:28 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Jerin Jacob <jerinj@marvell.com>


> ---
>  doc/guides/rel_notes/deprecation.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
>    These wrappers must be used for patches that need to be merged in 20.08
>    onwards. This change will not introduce any performance degradation.
>
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
> +  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> +  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
> +  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
> +  20.11 release.
> +
>  * igb_uio: In the view of reducing the kernel dependency from the main tree,
>    as a first step, the Technical Board decided to move ``igb_uio``
>    kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a
  2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
                     ` (3 preceding siblings ...)
  2020-07-03 18:57   ` [dpdk-dev] [PATCH v3 1/3] " Honnappa Nagarahalli
@ 2020-07-06 23:43   ` Honnappa Nagarahalli
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  4 siblings, 2 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-06 23:43 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
atomicity memory model.

Armv8-a memory model has been strengthened to require
other-multi-copy atomicity. This property requires memory accesses
from an observer to become visible to all other observers
simultaneously [3]. This means

a) A write arriving at an endpoint shared between multiple CPUs is
   visible to all CPUs
b) A write that is visible to all CPUs is also visible to all other
   observers in the shareability domain

This allows for using cheaper DMB instructions in the place of DSB
for devices that are visible to all CPUs (i.e. devices that DPDK
caters to).

Please refer to [1], [2] and [3] for more information.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
[2] https://www.youtube.com/watch?v=i6DayghhA8Q
[3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h
index 7b7099cdc..e42f69edc 100644
--- a/lib/librte_eal/arm/include/rte_atomic_64.h
+++ b/lib/librte_eal/arm/include/rte_atomic_64.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2020 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -19,11 +19,11 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_debug.h>
 
-#define rte_mb() asm volatile("dsb sy" : : : "memory")
+#define rte_mb() asm volatile("dmb osh" : : : "memory")
 
-#define rte_wmb() asm volatile("dsb st" : : : "memory")
+#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
 
-#define rte_rmb() asm volatile("dsb ld" : : : "memory")
+#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
 
 #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
 
@@ -37,9 +37,9 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
-#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
+#define rte_cio_wmb() rte_wmb()
 
-#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
+#define rte_cio_rmb() rte_rmb()
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes
  2020-07-06 23:43   ` [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
@ 2020-07-06 23:43     ` Honnappa Nagarahalli
  2020-07-07  8:36       ` David Marchand
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  1 sibling, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-06 23:43 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

Updated the use of DMB instruction in rte_*mb APIs for Armv8-a.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 doc/guides/rel_notes/release_20_08.rst | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
index 5cbc4ce14..567ae6b2a 100644
--- a/doc/guides/rel_notes/release_20_08.rst
+++ b/doc/guides/rel_notes/release_20_08.rst
@@ -56,6 +56,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =========================================================
 
+* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.**
+
+  Armv8-a memory model has been strengthened to require other-multi-copy
+  atomicity. This allows for using DMB instruction instead of DSB for IO
+  barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB
+  instruction to reflect this.
+
 * **Updated PCAP driver.**
 
   Updated PCAP driver with new features and improvements, including:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-06 23:43   ` [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
@ 2020-07-06 23:43     ` Honnappa Nagarahalli
  2020-07-07  8:39       ` David Marchand
                         ` (2 more replies)
  1 sibling, 3 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-06 23:43 UTC (permalink / raw)
  To: dev, honnappa.nagarahalli, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd

rte_cio_*mb APIs will be deprecated in 20.11 release.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
 doc/guides/rel_notes/deprecation.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index d1034f60f..59656da3d 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -40,6 +40,12 @@ Deprecation Notices
   These wrappers must be used for patches that need to be merged in 20.08
   onwards. This change will not introduce any performance degradation.
 
+* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
+  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
+  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
+  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
+  20.11 release.
+
 * igb_uio: In the view of reducing the kernel dependency from the main tree,
   as a first step, the Technical Board decided to move ``igb_uio``
   kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
@ 2020-07-07  8:36       ` David Marchand
  2020-07-07 18:37         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 70+ messages in thread
From: David Marchand @ 2020-07-07  8:36 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dev, Ruifeng Wang (Arm Technology China),
	Jerin Jacob Kollanukkaran, Hemant Agrawal, Ajit Khaparde,
	Igor Chauskin, Thomas Monjalon, Viacheslav Ovsiienko,
	Andrew Rybchenko, Bruce Richardson, nd

On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> Updated the use of DMB instruction in rte_*mb APIs for Armv8-a.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  doc/guides/rel_notes/release_20_08.rst | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst
> index 5cbc4ce14..567ae6b2a 100644
> --- a/doc/guides/rel_notes/release_20_08.rst
> +++ b/doc/guides/rel_notes/release_20_08.rst
> @@ -56,6 +56,13 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =========================================================
>

This release note update will be squashed with the change itself in patch 1.

> +* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.**

We use "ARMv8" in the release notes, any objection if I update this
when applying?


> +
> +  Armv8-a memory model has been strengthened to require other-multi-copy
> +  atomicity. This allows for using DMB instruction instead of DSB for IO
> +  barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB
> +  instruction to reflect this.
> +
>  * **Updated PCAP driver.**
>
>    Updated PCAP driver with new features and improvements, including:
> --
> 2.17.1
>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
@ 2020-07-07  8:39       ` David Marchand
  2020-07-07 20:14       ` David Christensen
  2020-07-08 11:49       ` David Marchand
  2 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-07-07  8:39 UTC (permalink / raw)
  To: dev
  Cc: Ruifeng Wang (Arm Technology China),
	Jerin Jacob Kollanukkaran, Hemant Agrawal, Ajit Khaparde,
	Igor Chauskin, Thomas Monjalon, Viacheslav Ovsiienko,
	Andrew Rybchenko, Bruce Richardson, nd, Honnappa Nagarahalli,
	Ananyev, Konstantin, David Christensen

On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>  doc/guides/rel_notes/deprecation.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
>    These wrappers must be used for patches that need to be merged in 20.08
>    onwards. This change will not introduce any performance degradation.
>
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
> +  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> +  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used

Nit: missing space.


> +  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
> +  20.11 release.
> +
>  * igb_uio: In the view of reducing the kernel dependency from the main tree,
>    as a first step, the Technical Board decided to move ``igb_uio``
>    kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory
> --
> 2.17.1
>

LGTM.

We need 3 acks (ideally from different vendors/companies) for a
deprecation notice.
Please maintainers?


Thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes
  2020-07-07  8:36       ` David Marchand
@ 2020-07-07 18:37         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-07 18:37 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Ruifeng Wang, jerinj, hemant.agrawal,
	Ajit Khaparde (ajit.khaparde@broadcom.com),
	Igor Chauskin, thomas, Viacheslav Ovsiienko, Andrew Rybchenko,
	Bruce Richardson, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: Re: [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier
> changes
> 
> On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com> wrote:
> >
> > Updated the use of DMB instruction in rte_*mb APIs for Armv8-a.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Acked-by: Jerin Jacob <jerinj@marvell.com>
> > ---
> >  doc/guides/rel_notes/release_20_08.rst | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/doc/guides/rel_notes/release_20_08.rst
> > b/doc/guides/rel_notes/release_20_08.rst
> > index 5cbc4ce14..567ae6b2a 100644
> > --- a/doc/guides/rel_notes/release_20_08.rst
> > +++ b/doc/guides/rel_notes/release_20_08.rst
> > @@ -56,6 +56,13 @@ New Features
> >       Also, make sure to start the actual text at the margin.
> >       =========================================================
> >
> 
> This release note update will be squashed with the change itself in patch 1.
> 
> > +* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.**
> 
> We use "ARMv8" in the release notes, any objection if I update this when
> applying?
No objections.

> 
> 
> > +
> > +  Armv8-a memory model has been strengthened to require
> > + other-multi-copy  atomicity. This allows for using DMB instruction
> > + instead of DSB for IO  barriers. rte_*mb APIs, for Armv8-a
> > + platforms, are changed to use DMB  instruction to reflect this.
> > +
> >  * **Updated PCAP driver.**
> >
> >    Updated PCAP driver with new features and improvements, including:
> > --
> > 2.17.1
> >
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  2020-07-07  8:39       ` David Marchand
@ 2020-07-07 20:14       ` David Christensen
  2020-07-08 11:49       ` David Marchand
  2 siblings, 0 replies; 70+ messages in thread
From: David Christensen @ 2020-07-07 20:14 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	bruce.richardson
  Cc: nd



On 7/6/20 4:43 PM, Honnappa Nagarahalli wrote:
> rte_cio_*mb APIs will be deprecated in 20.11 release.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
>   doc/guides/rel_notes/deprecation.rst | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
>     These wrappers must be used for patches that need to be merged in 20.08
>     onwards. This change will not introduce any performance degradation.
> 
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
> +  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> +  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
> +  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
> +  20.11 release.
> +

No difference between rte_cio_* and rte_io_* macros on PPC.

Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  2020-07-05  0:57       ` Jerin Jacob
@ 2020-07-07 20:19       ` Ajit Khaparde
  2020-07-08 11:05       ` Ananyev, Konstantin
  2 siblings, 0 replies; 70+ messages in thread
From: Ajit Khaparde @ 2020-07-07 20:19 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dpdk-dev, ruifeng.wang, Jerin Jacob Kollanukkaran,
	Hemant Agrawal, igorch, Thomas Monjalon, Viacheslav Ovsiienko,
	Andrew Rybchenko, Bruce Richardson, nd

On Fri, Jul 3, 2020 at 11:58 AM Honnappa Nagarahalli <
honnappa.nagarahalli@arm.com> wrote:

> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>

---
>  doc/guides/rel_notes/deprecation.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
>    These wrappers must be used for patches that need to be merged in 20.08
>    onwards. This change will not introduce any performance degradation.
>
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed
> from DSB
> +  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> +  APIs(taking all platforms into consideration). rte_io_*mb APIs should
> be used
> +  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be
> deprecated in
> +  20.11 release.
> +
>  * igb_uio: In the view of reducing the kernel dependency from the main
> tree,
>    as a first step, the Technical Board decided to move ``igb_uio``
>    kernel module to the dpdk-kmods repository in the /linux/igb_uio/
> directory
> --
> 2.17.1
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  2020-07-05  0:57       ` Jerin Jacob
  2020-07-07 20:19       ` Ajit Khaparde
@ 2020-07-08 11:05       ` Ananyev, Konstantin
  2 siblings, 0 replies; 70+ messages in thread
From: Ananyev, Konstantin @ 2020-07-08 11:05 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, ruifeng.wang, jerinj, hemant.agrawal,
	ajit.khaparde, igorch, thomas, viacheslavo, arybchenko,
	Richardson, Bruce
  Cc: nd



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa Nagarahalli
> Sent: Friday, July 3, 2020 7:58 PM
> To: dev@dpdk.org; honnappa.nagarahalli@arm.com; ruifeng.wang@arm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> ajit.khaparde@broadcom.com; igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com; arybchenko@solarflare.com;
> Richardson, Bruce <bruce.richardson@intel.com>
> Cc: nd@arm.com
> Subject: [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs
> 
> rte_cio_*mb APIs will be deprecated in 20.11 release.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  doc/guides/rel_notes/deprecation.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
>    These wrappers must be used for patches that need to be merged in 20.08
>    onwards. This change will not introduce any performance degradation.
> 
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
> +  to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> +  APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
> +  in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
> +  20.11 release.
> +
>  * igb_uio: In the view of reducing the kernel dependency from the main tree,
>    as a first step, the Technical Board decided to move ``igb_uio``
>    kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs
  2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
  2020-07-07  8:39       ` David Marchand
  2020-07-07 20:14       ` David Christensen
@ 2020-07-08 11:49       ` David Marchand
  2 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-07-08 11:49 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dev, Ruifeng Wang (Arm Technology China),
	Jerin Jacob Kollanukkaran, Hemant Agrawal, Ajit Khaparde,
	Igor Chauskin, Thomas Monjalon, Viacheslav Ovsiienko,
	Andrew Rybchenko, Bruce Richardson, nd

On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

Series applied, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-06-23  8:26   ` [dpdk-dev] [PATCH v3] net/mlx5: relaxed " Phil Yang
@ 2020-07-13  3:02     ` Phil Yang
  2020-07-20 23:21       ` Alexander Kozyrev
  2020-09-03  2:53     ` [dpdk-dev] [PATCH v4] " Phil Yang
  1 sibling, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-07-13  3:02 UTC (permalink / raw)
  To: matan, shahafs, viacheslavo
  Cc: Honnappa Nagarahalli, drc, nd, Phil Yang, dev, nd

Hi,

We are also doing C11 atomics converting for other components.
Your insight would be much appreciated.

Thanks,
Phil Yang

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, June 23, 2020 4:27 PM
> To: dev@dpdk.org
> Cc: matan@mellanox.com; shahafs@mellanox.com;
> viacheslavo@mellanox.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> <nd@arm.com>
> Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet
> RQ buffer refcnt
> 
> Use c11 atomics with explicit ordering instead of the rte_atomic ops
> which enforce unnecessary barriers on aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> ---
> v3:
> Split from the patchset:
> http://patchwork.dpdk.org/cover/68159/
> 
>  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
>  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
>  drivers/net/mlx5/mlx5_rxtx.h |  2 +-
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index dda0073..7f487f1 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void *opaque_arg,
> 
>  	memset(_m, 0, sizeof(*buf));
>  	buf->mp = mp;
> -	rte_atomic16_set(&buf->refcnt, 1);
> +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
>  	for (j = 0; j != strd_n; ++j) {
>  		shinfo = &buf->shinfos[j];
>  		shinfo->free_cb = mlx5_mprq_buf_free_cb;
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index e4106bf..f0eda88 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused, void *opaque)
>  {
>  	struct mlx5_mprq_buf *buf = opaque;
> 
> -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
>  		rte_mempool_put(buf->mp, buf);
> -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> -		rte_atomic16_set(&buf->refcnt, 1);
> +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> +					       __ATOMIC_RELAXED) == 0)) {
> +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
>  		rte_mempool_put(buf->mp, buf);
>  	}
>  }
> @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
> 
>  		if (consumed_strd == strd_n) {
>  			/* Replace WQE only if the buffer is still in use. */
> -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> +			if (__atomic_load_n(&buf->refcnt,
> +					    __ATOMIC_RELAXED) > 1) {
>  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
>  				/* Release the old buffer. */
>  				mlx5_mprq_buf_free(buf);
> @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
>  			void *buf_addr;
> 
>  			/* Increment the refcnt of the whole chunk. */
> -			rte_atomic16_add_return(&buf->refcnt, 1);
> -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> >refcnt) <=
> -				    strd_n + 1);
> +			__atomic_add_fetch(&buf->refcnt, 1,
> __ATOMIC_ACQUIRE);
> +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> +				    __ATOMIC_RELAXED) <= strd_n + 1);
>  			buf_addr = RTE_PTR_SUB(addr,
> RTE_PKTMBUF_HEADROOM);
>  			/*
>  			 * MLX5 device doesn't use iova but it is necessary in
> a
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index 26621ff..0fc15f3 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -78,7 +78,7 @@ struct rxq_zip {
>  /* Multi-Packet RQ buffer header. */
>  struct mlx5_mprq_buf {
>  	struct rte_mempool *mp;
> -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> +	uint16_t refcnt; /* Atomically accessed refcnt. */
>  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet. */
>  	struct rte_mbuf_ext_shared_info shinfos[];
>  	/*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-13  3:02     ` Phil Yang
@ 2020-07-20 23:21       ` Alexander Kozyrev
  2020-07-21  1:55         ` Phil Yang
  2020-07-23  4:47         ` Honnappa Nagarahalli
  0 siblings, 2 replies; 70+ messages in thread
From: Alexander Kozyrev @ 2020-07-20 23:21 UTC (permalink / raw)
  To: Phil Yang, Matan Azrad, Shahaf Shuler, Slava Ovsiienko
  Cc: Honnappa Nagarahalli, drc, nd, dev, nd

Hi Phil Yang, we noticed that this patch gives us 10% of performance degradation on ARM.
x86 seems to be unaffected though. Do you know what may be the reason of this behavior?

Regards,
Alex

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Sunday, July 12, 2020 23:02
> To: Matan Azrad <matan@mellanox.com>; Shahaf Shuler
> <shahafs@mellanox.com>; Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> drc@linux.vnet.ibm.com; nd <nd@arm.com>; Phil Yang <Phil.Yang@arm.com>;
> dev@dpdk.org; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet
> RQ buffer refcnt
> 
> Hi,
> 
> We are also doing C11 atomics converting for other components.
> Your insight would be much appreciated.
> 
> Thanks,
> Phil Yang
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > Sent: Tuesday, June 23, 2020 4:27 PM
> > To: dev@dpdk.org
> > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > <nd@arm.com>
> > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > multi-packet RQ buffer refcnt
> >
> > Use c11 atomics with explicit ordering instead of the rte_atomic ops
> > which enforce unnecessary barriers on aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > ---
> > v3:
> > Split from the patchset:
> > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%40m
> ellano
> >
> x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d1
> 49256
> >
> f461b%7C0%7C0%7C637302061620808255&amp;sdata=mRXbgPi6HyrVtP04Vl7
> Bx8lD0
> > trVP7noQlpOD7gBoTQ%3D&amp;reserved=0
> >
> >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> >  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> >  3 files changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > index dda0073..7f487f1 100644
> > --- a/drivers/net/mlx5/mlx5_rxq.c
> > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void
> > *opaque_arg,
> >
> >  	memset(_m, 0, sizeof(*buf));
> >  	buf->mp = mp;
> > -	rte_atomic16_set(&buf->refcnt, 1);
> > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> >  	for (j = 0; j != strd_n; ++j) {
> >  		shinfo = &buf->shinfos[j];
> >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > e4106bf..f0eda88 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused,
> > void *opaque)  {
> >  	struct mlx5_mprq_buf *buf = opaque;
> >
> > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> >  		rte_mempool_put(buf->mp, buf);
> > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > -		rte_atomic16_set(&buf->refcnt, 1);
> > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > +					       __ATOMIC_RELAXED) == 0)) {
> > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> >  		rte_mempool_put(buf->mp, buf);
> >  	}
> >  }
> > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> >
> >  		if (consumed_strd == strd_n) {
> >  			/* Replace WQE only if the buffer is still in use. */
> > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > +			if (__atomic_load_n(&buf->refcnt,
> > +					    __ATOMIC_RELAXED) > 1) {
> >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
> >  				/* Release the old buffer. */
> >  				mlx5_mprq_buf_free(buf);
> > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> >  			void *buf_addr;
> >
> >  			/* Increment the refcnt of the whole chunk. */
> > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > >refcnt) <=
> > -				    strd_n + 1);
> > +			__atomic_add_fetch(&buf->refcnt, 1,
> > __ATOMIC_ACQUIRE);
> > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> >  			buf_addr = RTE_PTR_SUB(addr,
> > RTE_PKTMBUF_HEADROOM);
> >  			/*
> >  			 * MLX5 device doesn't use iova but it is necessary in a
> diff
> > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > index 26621ff..0fc15f3 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > @@ -78,7 +78,7 @@ struct rxq_zip {
> >  /* Multi-Packet RQ buffer header. */
> >  struct mlx5_mprq_buf {
> >  	struct rte_mempool *mp;
> > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet.
> > */
> >  	struct rte_mbuf_ext_shared_info shinfos[];
> >  	/*
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-20 23:21       ` Alexander Kozyrev
@ 2020-07-21  1:55         ` Phil Yang
  2020-07-21  3:58           ` Alexander Kozyrev
  2020-07-23  4:47         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-07-21  1:55 UTC (permalink / raw)
  To: Alexander Kozyrev, Matan Azrad, Shahaf Shuler, Slava Ovsiienko
  Cc: Honnappa Nagarahalli, drc, nd, dev, nd, nd

Alexander Kozyrev <akozyrev@mellanox.com> writes:

<...>
> Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-
> packet RQ buffer refcnt
> 
> Hi Phil Yang, we noticed that this patch gives us 10% of performance
> degradation on ARM.
> x86 seems to be unaffected though. Do you know what may be the reason
> of this behavior?

Hi Alexander,

Thanks for your feedback. 
This patch removed some expensive memory barriers on aarch64, it should get better performance. 
I am not sure the root cause of this degradation now, I will start the investigation. We can profiling this issue together.
Could you share your test case(including your testbed configuration) with us? 

Thanks,
Phil
> 
> Regards,
> Alex
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > Sent: Sunday, July 12, 2020 23:02
> > To: Matan Azrad <matan@mellanox.com>; Shahaf Shuler
> > <shahafs@mellanox.com>; Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > drc@linux.vnet.ibm.com; nd <nd@arm.com>; Phil Yang
> <Phil.Yang@arm.com>;
> > dev@dpdk.org; nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-
> packet
> > RQ buffer refcnt
> >
> > Hi,
> >
> > We are also doing C11 atomics converting for other components.
> > Your insight would be much appreciated.
> >
> > Thanks,
> > Phil Yang
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > > Sent: Tuesday, June 23, 2020 4:27 PM
> > > To: dev@dpdk.org
> > > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > > <nd@arm.com>
> > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > multi-packet RQ buffer refcnt
> > >
> > > Use c11 atomics with explicit ordering instead of the rte_atomic ops
> > > which enforce unnecessary barriers on aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > ---
> > > v3:
> > > Split from the patchset:
> > >
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%40
> m
> > ellano
> > >
> >
> x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d
> 1
> > 49256
> > >
> >
> f461b%7C0%7C0%7C637302061620808255&amp;sdata=mRXbgPi6HyrVtP04Vl
> 7
> > Bx8lD0
> > > trVP7noQlpOD7gBoTQ%3D&amp;reserved=0
> > >
> > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> > >  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > > index dda0073..7f487f1 100644
> > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void
> > > *opaque_arg,
> > >
> > >  	memset(_m, 0, sizeof(*buf));
> > >  	buf->mp = mp;
> > > -	rte_atomic16_set(&buf->refcnt, 1);
> > > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  	for (j = 0; j != strd_n; ++j) {
> > >  		shinfo = &buf->shinfos[j];
> > >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > > e4106bf..f0eda88 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > __rte_unused,
> > > void *opaque)  {
> > >  	struct mlx5_mprq_buf *buf = opaque;
> > >
> > > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > >  		rte_mempool_put(buf->mp, buf);
> > > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > -		rte_atomic16_set(&buf->refcnt, 1);
> > > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > +					       __ATOMIC_RELAXED) == 0)) {
> > > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  		rte_mempool_put(buf->mp, buf);
> > >  	}
> > >  }
> > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >
> > >  		if (consumed_strd == strd_n) {
> > >  			/* Replace WQE only if the buffer is still in use. */
> > > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > +			if (__atomic_load_n(&buf->refcnt,
> > > +					    __ATOMIC_RELAXED) > 1) {
> > >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> > strd_n);
> > >  				/* Release the old buffer. */
> > >  				mlx5_mprq_buf_free(buf);
> > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >  			void *buf_addr;
> > >
> > >  			/* Increment the refcnt of the whole chunk. */
> > > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > >refcnt) <=
> > > -				    strd_n + 1);
> > > +			__atomic_add_fetch(&buf->refcnt, 1,
> > > __ATOMIC_ACQUIRE);
> > > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> > >  			buf_addr = RTE_PTR_SUB(addr,
> > > RTE_PKTMBUF_HEADROOM);
> > >  			/*
> > >  			 * MLX5 device doesn't use iova but it is necessary in
> a
> > diff
> > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > > index 26621ff..0fc15f3 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > @@ -78,7 +78,7 @@ struct rxq_zip {
> > >  /* Multi-Packet RQ buffer header. */
> > >  struct mlx5_mprq_buf {
> > >  	struct rte_mempool *mp;
> > > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> > >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > packet.
> > > */
> > >  	struct rte_mbuf_ext_shared_info shinfos[];
> > >  	/*
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-21  1:55         ` Phil Yang
@ 2020-07-21  3:58           ` Alexander Kozyrev
  2020-07-21  4:03             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-07-21  3:58 UTC (permalink / raw)
  To: Phil Yang, Matan Azrad, Shahaf Shuler, Slava Ovsiienko
  Cc: Honnappa Nagarahalli, drc, nd, dev, nd, nd

> Phil Yang <Phil.Yang@arm.com> writes:
> 
> <...>
> > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > multi- packet RQ buffer refcnt
> >
> > Hi Phil Yang, we noticed that this patch gives us 10% of performance
> > degradation on ARM.
> > x86 seems to be unaffected though. Do you know what may be the reason
> > of this behavior?
> 
> Hi Alexander,
> 
> Thanks for your feedback.
> This patch removed some expensive memory barriers on aarch64, it should get
> better performance.
> I am not sure the root cause of this degradation now, I will start the
> investigation. We can profiling this issue together.
> Could you share your test case(including your testbed configuration) with us?
> 
> Thanks,
> Phil

I'm surprised too, Phil, but looks like it is actually making things worse. I used Connect-X 6DX on aarch64:
Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux
Traffic generator sends 60 bytes packets and DUT executes the following command:
arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4  -w 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1  -w 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe  -- --burst=64 --mbcache=512 -i  --nb-cores=1  --rxq=1 --txq=1 --txd=256 --rxd=256  --auto-start --rss-udp
Without a patch I'm getting 3.2mpps, and only 2.9mpps when the patch is applied.

Regards,
Alex

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-21  3:58           ` Alexander Kozyrev
@ 2020-07-21  4:03             ` Honnappa Nagarahalli
  2020-07-21  4:11               ` Alexander Kozyrev
  0 siblings, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-21  4:03 UTC (permalink / raw)
  To: Alexander Kozyrev, Phil Yang, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, Honnappa Nagarahalli, nd

<snip>

> > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > multi- packet RQ buffer refcnt
> > >
> > > Hi Phil Yang, we noticed that this patch gives us 10% of performance
> > > degradation on ARM.
> > > x86 seems to be unaffected though. Do you know what may be the
> > > reason of this behavior?
> >
> > Hi Alexander,
> >
> > Thanks for your feedback.
> > This patch removed some expensive memory barriers on aarch64, it
> > should get better performance.
> > I am not sure the root cause of this degradation now, I will start the
> > investigation. We can profiling this issue together.
> > Could you share your test case(including your testbed configuration) with
> us?
> >
> > Thanks,
> > Phil
> 
> I'm surprised too, Phil, but looks like it is actually making things worse. I used
> Connect-X 6DX on aarch64:
> Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2
> 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator
> sends 60 bytes packets and DUT executes the following command:
> arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4  -w
> 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1  -w
> 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe  -- --burst=64 --
> mbcache=512 -i  --nb-cores=1  --rxq=1 --txq=1 --txd=256 --rxd=256  --auto-
> start --rss-udp Without a patch I'm getting 3.2mpps, and only 2.9mpps when
> the patch is applied.
You are running on A72 cores, is that correct?

> 
> Regards,
> Alex

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-21  4:03             ` Honnappa Nagarahalli
@ 2020-07-21  4:11               ` Alexander Kozyrev
  2020-07-22 12:06                 ` Phil Yang
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-07-21  4:11 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, nd

<snip>
> 
> > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > multi- packet RQ buffer refcnt
> > > >
> > > > Hi Phil Yang, we noticed that this patch gives us 10% of
> > > > performance degradation on ARM.
> > > > x86 seems to be unaffected though. Do you know what may be the
> > > > reason of this behavior?
> > >
> > > Hi Alexander,
> > >
> > > Thanks for your feedback.
> > > This patch removed some expensive memory barriers on aarch64, it
> > > should get better performance.
> > > I am not sure the root cause of this degradation now, I will start
> > > the investigation. We can profiling this issue together.
> > > Could you share your test case(including your testbed configuration)
> > > with
> > us?
> > >
> > > Thanks,
> > > Phil
> >
> > I'm surprised too, Phil, but looks like it is actually making things
> > worse. I used Connect-X 6DX on aarch64:
> > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2
> > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator
> > sends 60 bytes packets and DUT executes the following command:
> > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4  -w
> > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1  -w
> > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe  -- --burst=64 --
> > mbcache=512 -i  --nb-cores=1  --rxq=1 --txq=1 --txd=256 --rxd=256
> > --auto- start --rss-udp Without a patch I'm getting 3.2mpps, and only
> > 2.9mpps when the patch is applied.
> You are running on A72 cores, is that correct?

Correct, cat /proc/cpuinfo
processor       : 0
BogoMIPS        : 312.50
Features        : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-21  4:11               ` Alexander Kozyrev
@ 2020-07-22 12:06                 ` Phil Yang
  0 siblings, 0 replies; 70+ messages in thread
From: Phil Yang @ 2020-07-22 12:06 UTC (permalink / raw)
  To: Alexander Kozyrev, Honnappa Nagarahalli, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd, nd

Alexander Kozyrev <akozyrev@mellanox.com> writes:

<snip>
> >
> > > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > > multi- packet RQ buffer refcnt
> > > > >
> > > > > Hi Phil Yang, we noticed that this patch gives us 10% of
> > > > > performance degradation on ARM.
> > > > > x86 seems to be unaffected though. Do you know what may be the
> > > > > reason of this behavior?
> > > >
> > > > Hi Alexander,
> > > >
> > > > Thanks for your feedback.
> > > > This patch removed some expensive memory barriers on aarch64, it
> > > > should get better performance.
> > > > I am not sure the root cause of this degradation now, I will start
> > > > the investigation. We can profiling this issue together.
> > > > Could you share your test case(including your testbed configuration)
> > > > with
> > > us?
<...>
> > >
> > > I'm surprised too, Phil, but looks like it is actually making things
> > > worse. I used Connect-X 6DX on aarch64:
> > > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2
> > > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator
> > > sends 60 bytes packets and DUT executes the following command:
> > > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4  -w
> > > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1  -w
> > > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe  -- --burst=64 --
> > > mbcache=512 -i  --nb-cores=1  --rxq=1 --txq=1 --txd=256 --rxd=256
> > > --auto- start --rss-udp Without a patch I'm getting 3.2mpps, and only
> > > 2.9mpps when the patch is applied.
> > You are running on A72 cores, is that correct?
> 
> Correct, cat /proc/cpuinfo
> processor       : 0
> BogoMIPS        : 312.50
> Features        : fp asimd evtstrm crc32 cpuid
> CPU implementer : 0x41
> CPU architecture: 8
> CPU variant     : 0x0
> CPU part        : 0xd08
> CPU revision    : 3

Thanks a lot for your input, Alex.
With your test command line, I remeasured this patch on two different aarch64 machines and both got some performance improvement.

SOC#1. On Thunderx2 (with LSE support), I see 7.6% performance improvement on throughput. 
NIC: ConnectX-6 / driver: mlx5_core version: 5.0-1.0.0.0 / firmware-version: 20.27.1016 (MT_0000000224)

SOC#2. On N1SDP (I disabled LSE to generate A72 likewise instructions), I also see slightly (about 1%~2%) performance improvement on throughput.
NIC: ConnectX-5 / driver: mlx5_core / version: 5.0-2.1.8 / firmware-version: 16.27.2008 (MT_0000000090)

Without LSE (i.e. A72 and SOC#2 case.) it uses the 'Exclusive' mechanism to achieve atomicity.
For example, it generates below instructions for __atomic_add_fetch.
__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE);
   70118:       f94037e3        ldr     x3, [sp, #104]
   7011c:       91002060        add     x0, x3, #0x8
   70120:       485ffc02        ldaxrh  w2, [x0]
   70124:       11000442        add     w2, w2, #0x1
   70128:       48057c02        stxrh   w5, w2, [x0]
   7012c:       35ffffa5        cbnz    w5, 70120 <mlx5_rx_burst_mprq+0xb48>

In general, I think this patch will not lead to a sharp decline in performance. 
Maybe you can try other testbeds?

Thanks,
Phil





^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-20 23:21       ` Alexander Kozyrev
  2020-07-21  1:55         ` Phil Yang
@ 2020-07-23  4:47         ` Honnappa Nagarahalli
  2020-07-23  6:11           ` Phil Yang
  1 sibling, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-07-23  4:47 UTC (permalink / raw)
  To: Alexander Kozyrev, Phil Yang, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, Honnappa Nagarahalli, nd

Hi Alexander,
	Thank you for testing this patch. Few comments below.

<snip>

> > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > multi-packet RQ buffer refcnt
> >
> > Hi,
> >
> > We are also doing C11 atomics converting for other components.
> > Your insight would be much appreciated.
> >
> > Thanks,
> > Phil Yang
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > > Sent: Tuesday, June 23, 2020 4:27 PM
> > > To: dev@dpdk.org
> > > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > > <nd@arm.com>
> > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > multi-packet RQ buffer refcnt
> > >
> > > Use c11 atomics with explicit ordering instead of the rte_atomic ops
> > > which enforce unnecessary barriers on aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > ---
> > > v3:
> > > Split from the patchset:
> > > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpat
> > > ch
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%40m
> > ellano
> > >
> >
> x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d1
> > 49256
> > >
> >
> f461b%7C0%7C0%7C637302061620808255&amp;sdata=mRXbgPi6HyrVtP04Vl7
> > Bx8lD0
> > > trVP7noQlpOD7gBoTQ%3D&amp;reserved=0
> > >
> > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-  drivers/net/mlx5/mlx5_rxtx.c
> > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644
> > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> > > void *opaque_arg,
> > >
> > >  	memset(_m, 0, sizeof(*buf));
> > >  	buf->mp = mp;
> > > -	rte_atomic16_set(&buf->refcnt, 1);
> > > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  	for (j = 0; j != strd_n; ++j) {
> > >  		shinfo = &buf->shinfos[j];
> > >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > > e4106bf..f0eda88 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > __rte_unused,
> > > void *opaque)  {
> > >  	struct mlx5_mprq_buf *buf = opaque;
> > >
> > > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > >  		rte_mempool_put(buf->mp, buf);
> > > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > -		rte_atomic16_set(&buf->refcnt, 1);
> > > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > +					       __ATOMIC_RELAXED) == 0)) {
> > > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  		rte_mempool_put(buf->mp, buf);
> > >  	}
> > >  }
> > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >
> > >  		if (consumed_strd == strd_n) {
> > >  			/* Replace WQE only if the buffer is still in use. */
> > > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > +			if (__atomic_load_n(&buf->refcnt,
> > > +					    __ATOMIC_RELAXED) > 1) {
> > >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> > strd_n);
> > >  				/* Release the old buffer. */
> > >  				mlx5_mprq_buf_free(buf);
> > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >  			void *buf_addr;
> > >
> > >  			/* Increment the refcnt of the whole chunk. */
> > > -			rte_atomic16_add_return(&buf->refcnt, 1);
rte_atomic16_add_return includes a full barrier along with atomic operation. But is full barrier required here? For ex: __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be enough?

> > > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > >refcnt) <=
> > > -				    strd_n + 1);
> > > +			__atomic_add_fetch(&buf->refcnt, 1,
> > > __ATOMIC_ACQUIRE);

Can you replace just the above line with the following lines and test it?

__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
__atomic_thread_fence(__ATOMIC_ACQ_REL);

This should make the generated code same as before this patch. Let me know if you would prefer us to re-spin the patch instead (for testing).

> > > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> > >  			buf_addr = RTE_PTR_SUB(addr,
> > > RTE_PKTMBUF_HEADROOM);
> > >  			/*
> > >  			 * MLX5 device doesn't use iova but it is necessary in a
> > diff
> > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > > index 26621ff..0fc15f3 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > @@ -78,7 +78,7 @@ struct rxq_zip {
> > >  /* Multi-Packet RQ buffer header. */  struct mlx5_mprq_buf {
> > >  	struct rte_mempool *mp;
> > > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> > >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > packet.
> > > */
> > >  	struct rte_mbuf_ext_shared_info shinfos[];
> > >  	/*
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-23  4:47         ` Honnappa Nagarahalli
@ 2020-07-23  6:11           ` Phil Yang
  2020-07-23 16:53             ` Alexander Kozyrev
  0 siblings, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-07-23  6:11 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Alexander Kozyrev, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd, nd

<snip>
> 
> > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > multi-packet RQ buffer refcnt
> > >
> > > Hi,
> > >
> > > We are also doing C11 atomics converting for other components.
> > > Your insight would be much appreciated.
> > >
> > > Thanks,
> > > Phil Yang
> > >
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > > > Sent: Tuesday, June 23, 2020 4:27 PM
> > > > To: dev@dpdk.org
> > > > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > > > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > > > <nd@arm.com>
> > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > multi-packet RQ buffer refcnt
> > > >
> > > > Use c11 atomics with explicit ordering instead of the rte_atomic ops
> > > > which enforce unnecessary barriers on aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > ---
<...>
> > > >
> > > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-  drivers/net/mlx5/mlx5_rxtx.c
> > > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> *mp,
> > > > void *opaque_arg,
> > > >
> > > >  memset(_m, 0, sizeof(*buf));
> > > >  buf->mp = mp;
> > > > -rte_atomic16_set(&buf->refcnt, 1);
> > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > >  for (j = 0; j != strd_n; ++j) {
> > > >  shinfo = &buf->shinfos[j];
> > > >  shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > > > e4106bf..f0eda88 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > __rte_unused,
> > > > void *opaque)  {
> > > >  struct mlx5_mprq_buf *buf = opaque;
> > > >
> > > > -if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > +if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > > >  rte_mempool_put(buf->mp, buf);
> > > > -} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > -rte_atomic16_set(&buf->refcnt, 1);
> > > > +} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > +       __ATOMIC_RELAXED) == 0)) {
> > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > >  rte_mempool_put(buf->mp, buf);
> > > >  }
> > > >  }
> > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > >
> > > >  if (consumed_strd == strd_n) {
> > > >  /* Replace WQE only if the buffer is still in use. */
> > > > -if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > +if (__atomic_load_n(&buf->refcnt,
> > > > +    __ATOMIC_RELAXED) > 1) {
> > > >  mprq_buf_replace(rxq, rq_ci & wq_mask,
> > > strd_n);
> > > >  /* Release the old buffer. */
> > > >  mlx5_mprq_buf_free(buf);
> > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > >  void *buf_addr;
> > > >
> > > >  /* Increment the refcnt of the whole chunk. */
> > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> rte_atomic16_add_return includes a full barrier along with atomic operation.
> But is full barrier required here? For ex: __atomic_add_fetch(&buf->refcnt, 1,
> __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be
> enough?
> 
> > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > >refcnt) <=
> > > > -    strd_n + 1);
> > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > __ATOMIC_ACQUIRE);

The atomic load in MLX5_ASSERT() accesses the same memory space as the previous __atomic_add_fetch() does.
They will access this memory space in the program order when we enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch() becomes unnecessary.

By changing it to RELAXED ordering, this patch got 7.6% performance improvement on N1 (making it generate A72 alike instructions).

Could you please also try it on your testbed, Alex?

> 
> Can you replace just the above line with the following lines and test it?
> 
> __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> __atomic_thread_fence(__ATOMIC_ACQ_REL);
> 
> This should make the generated code same as before this patch. Let me
> know if you would prefer us to re-spin the patch instead (for testing).
> 
> > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > >  buf_addr = RTE_PTR_SUB(addr,
> > > > RTE_PKTMBUF_HEADROOM);
> > > >  /*
> > > >   * MLX5 device doesn't use iova but it is necessary in a
> > > diff
> > > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > > > index 26621ff..0fc15f3 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > @@ -78,7 +78,7 @@ struct rxq_zip {
> > > >  /* Multi-Packet RQ buffer header. */  struct mlx5_mprq_buf {
> > > >  struct rte_mempool *mp;
> > > > -rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > > +uint16_t refcnt; /* Atomically accessed refcnt. */
> > > >  uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > > packet.
> > > > */
> > > >  struct rte_mbuf_ext_shared_info shinfos[];
> > > >  /*
> > > > --
> > > > 2.7.4
> 


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-23  6:11           ` Phil Yang
@ 2020-07-23 16:53             ` Alexander Kozyrev
  2020-07-27 14:52               ` Phil Yang
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-07-23 16:53 UTC (permalink / raw)
  To: Phil Yang, Honnappa Nagarahalli, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, nd, nd

> <snip>
> >
> > > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > multi-packet RQ buffer refcnt
> > > >
> > > > Hi,
> > > >
> > > > We are also doing C11 atomics converting for other components.
> > > > Your insight would be much appreciated.
> > > >
> > > > Thanks,
> > > > Phil Yang
> > > >
> > > > > -----Original Message-----
> > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > > > > Sent: Tuesday, June 23, 2020 4:27 PM
> > > > > To: dev@dpdk.org
> > > > > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > > > > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > > > > <nd@arm.com>
> > > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > > multi-packet RQ buffer refcnt
> > > > >
> > > > > Use c11 atomics with explicit ordering instead of the rte_atomic
> > > > > ops which enforce unnecessary barriers on aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > ---
> <...>
> > > > >
> > > > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> > > > > drivers/net/mlx5/mlx5_rxtx.c
> > > > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > > > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> > *mp,
> > > > > void *opaque_arg,
> > > > >
> > > > >  memset(_m, 0, sizeof(*buf));
> > > > >  buf->mp = mp;
> > > > > -rte_atomic16_set(&buf->refcnt, 1);
> > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > >  for (j = 0; j != strd_n; ++j) {  shinfo = &buf->shinfos[j];
> > > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > index
> > > > > e4106bf..f0eda88 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > > __rte_unused,
> > > > > void *opaque)  {
> > > > >  struct mlx5_mprq_buf *buf = opaque;
> > > > >
> > > > > -if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > > +if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > > > >  rte_mempool_put(buf->mp, buf);
> > > > > -} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > > -rte_atomic16_set(&buf->refcnt, 1);
> > > > > +} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > > +       __ATOMIC_RELAXED) == 0)) {
> > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > >  rte_mempool_put(buf->mp, buf);
> > > > >  }
> > > > >  }
> > > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > >
> > > > >  if (consumed_strd == strd_n) {
> > > > >  /* Replace WQE only if the buffer is still in use. */ -if
> > > > > (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > > +if (__atomic_load_n(&buf->refcnt,
> > > > > +    __ATOMIC_RELAXED) > 1) {
> > > > >  mprq_buf_replace(rxq, rq_ci & wq_mask,
> > > > strd_n);
> > > > >  /* Release the old buffer. */
> > > > >  mlx5_mprq_buf_free(buf);
> > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > >
> > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > rte_atomic16_add_return includes a full barrier along with atomic
> operation.
> > But is full barrier required here? For ex:
> > __atomic_add_fetch(&buf->refcnt, 1,
> > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be
> > enough?
> >
> > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > >refcnt) <=
> > > > > -    strd_n + 1);
> > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > __ATOMIC_ACQUIRE);
> 
> The atomic load in MLX5_ASSERT() accesses the same memory space as the
> previous __atomic_add_fetch() does.
> They will access this memory space in the program order when we enabled
> MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch()
> becomes unnecessary.
> 
> By changing it to RELAXED ordering, this patch got 7.6% performance
> improvement on N1 (making it generate A72 alike instructions).
> 
> Could you please also try it on your testbed, Alex?

Situation got better with this modification, here are the results:
 - no patch:             3.0 Mpps CPU cycles/packet=51.52
 - original patch:    2.1 Mpps CPU cycles/packet=71.05
 - modified patch: 2.9 Mpps CPU cycles/packet=52.79
Also, I found that the degradation is there only in case I enable bursts stats.
Could you please turn on the following config options and see if you can reproduce this as well?
CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y

> >
> > Can you replace just the above line with the following lines and test it?
> >
> > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> >
> > This should make the generated code same as before this patch. Let me
> > know if you would prefer us to re-spin the patch instead (for testing).
> >
> > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > >  buf_addr = RTE_PTR_SUB(addr,
> > > > > RTE_PKTMBUF_HEADROOM);
> > > > >  /*
> > > > >   * MLX5 device doesn't use iova but it is necessary in a
> > > > diff
> > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > > @@ -78,7 +78,7 @@ struct rxq_zip {
> > > > >  /* Multi-Packet RQ buffer header. */  struct mlx5_mprq_buf {
> > > > > struct rte_mempool *mp; -rte_atomic16_t refcnt; /* Atomically
> > > > > accessed refcnt. */
> > > > > +uint16_t refcnt; /* Atomically accessed refcnt. */
> > > > >  uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > > > packet.
> > > > > */
> > > > >  struct rte_mbuf_ext_shared_info shinfos[];
> > > > >  /*
> > > > > --
> > > > > 2.7.4
> >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-23 16:53             ` Alexander Kozyrev
@ 2020-07-27 14:52               ` Phil Yang
  2020-08-06  2:43                 ` Alexander Kozyrev
  0 siblings, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-07-27 14:52 UTC (permalink / raw)
  To: Alexander Kozyrev, Honnappa Nagarahalli, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd

Alexander Kozyrev <akozyrev@mellanox.com> writes:

<snip>

> > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> > struct
> > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > >
> > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > rte_atomic16_add_return includes a full barrier along with atomic
> > operation.
> > > But is full barrier required here? For ex:
> > > __atomic_add_fetch(&buf->refcnt, 1,
> > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be
> > > enough?
> > >
> > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > >refcnt) <=
> > > > > > -    strd_n + 1);
> > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > __ATOMIC_ACQUIRE);
> >
> > The atomic load in MLX5_ASSERT() accesses the same memory space as the
> > previous __atomic_add_fetch() does.
> > They will access this memory space in the program order when we enabled
> > MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch()
> > becomes unnecessary.
> >
> > By changing it to RELAXED ordering, this patch got 7.6% performance
> > improvement on N1 (making it generate A72 alike instructions).
> >
> > Could you please also try it on your testbed, Alex?
> 
> Situation got better with this modification, here are the results:
>  - no patch:             3.0 Mpps CPU cycles/packet=51.52
>  - original patch:    2.1 Mpps CPU cycles/packet=71.05
>  - modified patch: 2.9 Mpps CPU cycles/packet=52.79
> Also, I found that the degradation is there only in case I enable bursts stats.


Great! So this patch will not hurt the normal datapath performance.


> Could you please turn on the following config options and see if you can
> reproduce this as well?
> CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y

Thanks, Alex. Some updates.

Slightly (about 1%) throughput degradation was detected after we enabled these two config options on N1 SoC.

If we look insight the perf stats results, with this patch, both mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the original code. 
However, __memcpy_generic takes more cycles. I think that might be the reason for CPU cycles per packet increment after applying this patch.

Original code:
98.07%--pkt_burst_io_forward
        |
        |--44.53%--__memcpy_generic
        |
        |--35.85%--mlx5_rx_burst_mprq
        |
        |--15.94%--mlx5_tx_burst_none_empw
        |          |
        |          |--7.32%--mlx5_tx_handle_completion.isra.0
        |          |
        |           --0.50%--__memcpy_generic
        |
         --1.14%--memcpy@plt

Use C11 with RELAXED ordering:
99.36%--pkt_burst_io_forward
        |
        |--47.40%--__memcpy_generic
        |
        |--34.62%--mlx5_rx_burst_mprq
        |
        |--15.55%--mlx5_tx_burst_none_empw
        |          |
        |           --7.08%--mlx5_tx_handle_completion.isra.0
        |
         --1.17%--memcpy@plt

BTW, all the atomic operations in this patch are not the hotspot.

> 
> > >
> > > Can you replace just the above line with the following lines and test it?
> > >
> > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > >
> > > This should make the generated code same as before this patch. Let me
> > > know if you would prefer us to re-spin the patch instead (for testing).
> > >
> > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > >  buf_addr = RTE_PTR_SUB(addr,
> > > > > > RTE_PKTMBUF_HEADROOM);
> > > > > >  /*
> > > > > >   * MLX5 device doesn't use iova but it is necessary in a
> > > > > diff
> > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644
> > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
<snip>
> > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-07-27 14:52               ` Phil Yang
@ 2020-08-06  2:43                 ` Alexander Kozyrev
  2020-08-11  5:20                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-08-06  2:43 UTC (permalink / raw)
  To: Phil Yang, Honnappa Nagarahalli, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, nd

> Phil Yang <Phil.Yang@arm.com> writes:
> 
> <snip>
> 
> > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> > > struct
> > > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > > >
> > > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > > rte_atomic16_add_return includes a full barrier along with atomic
> > > operation.
> > > > But is full barrier required here? For ex:
> > > > __atomic_add_fetch(&buf->refcnt, 1,
> > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that
> > > > be enough?
> > > >
> > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > > >refcnt) <=
> > > > > > > -    strd_n + 1);
> > > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > __ATOMIC_ACQUIRE);
> > >
> > > The atomic load in MLX5_ASSERT() accesses the same memory space as
> > > the previous __atomic_add_fetch() does.
> > > They will access this memory space in the program order when we
> > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in
> > > __atomic_add_fetch() becomes unnecessary.
> > >
> > > By changing it to RELAXED ordering, this patch got 7.6% performance
> > > improvement on N1 (making it generate A72 alike instructions).
> > >
> > > Could you please also try it on your testbed, Alex?
> >
> > Situation got better with this modification, here are the results:
> >  - no patch:             3.0 Mpps CPU cycles/packet=51.52
> >  - original patch:    2.1 Mpps CPU cycles/packet=71.05
> >  - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found that
> > the degradation is there only in case I enable bursts stats.
> 
> 
> Great! So this patch will not hurt the normal datapath performance.
> 
> 
> > Could you please turn on the following config options and see if you
> > can reproduce this as well?
> > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y
> 
> Thanks, Alex. Some updates.
> 
> Slightly (about 1%) throughput degradation was detected after we enabled
> these two config options on N1 SoC.
> 
> If we look insight the perf stats results, with this patch, both mlx5_rx_burst
> and mlx5_tx_burst consume fewer CPU cycles than the original code.
> However, __memcpy_generic takes more cycles. I think that might be the
> reason for CPU cycles per packet increment after applying this patch.
> 
> Original code:
> 98.07%--pkt_burst_io_forward
>         |
>         |--44.53%--__memcpy_generic
>         |
>         |--35.85%--mlx5_rx_burst_mprq
>         |
>         |--15.94%--mlx5_tx_burst_none_empw
>         |          |
>         |          |--7.32%--mlx5_tx_handle_completion.isra.0
>         |          |
>         |           --0.50%--__memcpy_generic
>         |
>          --1.14%--memcpy@plt
> 
> Use C11 with RELAXED ordering:
> 99.36%--pkt_burst_io_forward
>         |
>         |--47.40%--__memcpy_generic
>         |
>         |--34.62%--mlx5_rx_burst_mprq
>         |
>         |--15.55%--mlx5_tx_burst_none_empw
>         |          |
>         |           --7.08%--mlx5_tx_handle_completion.isra.0
>         |
>          --1.17%--memcpy@plt
> 
> BTW, all the atomic operations in this patch are not the hotspot.

Phil, we are seeing much worse degradation on our ARM platform unfortunately.
I don't think that discrepancy in memcpy can explain this behavior.
Your patch is not touching this area of code. Let me collect some perf stat on our side.

> 
> >
> > > >
> > > > Can you replace just the above line with the following lines and test it?
> > > >
> > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > > >
> > > > This should make the generated code same as before this patch. Let
> > > > me know if you would prefer us to re-spin the patch instead (for
> testing).
> > > >
> > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > > >  buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM);
> > > > > > >  /*
> > > > > > >   * MLX5 device doesn't use iova but it is necessary in a
> > > > > > diff
> > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644
> > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> <snip>
> > > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-08-06  2:43                 ` Alexander Kozyrev
@ 2020-08-11  5:20                   ` Honnappa Nagarahalli
  2020-09-02 21:52                     ` Alexander Kozyrev
  0 siblings, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-08-11  5:20 UTC (permalink / raw)
  To: Alexander Kozyrev, Phil Yang, Matan Azrad, Shahaf Shuler,
	Slava Ovsiienko
  Cc: drc, nd, dev, Honnappa Nagarahalli, nd

<snip>

> >
> > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> > > > struct
> > > > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > > > >
> > > > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > rte_atomic16_add_return includes a full barrier along with
> > > > > atomic
> > > > operation.
> > > > > But is full barrier required here? For ex:
> > > > > __atomic_add_fetch(&buf->refcnt, 1,
> > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would
> > > > > that be enough?
> > > > >
> > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > > > >refcnt) <=
> > > > > > > > -    strd_n + 1);
> > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > > __ATOMIC_ACQUIRE);
> > > >
> > > > The atomic load in MLX5_ASSERT() accesses the same memory space as
> > > > the previous __atomic_add_fetch() does.
> > > > They will access this memory space in the program order when we
> > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in
> > > > __atomic_add_fetch() becomes unnecessary.
> > > >
> > > > By changing it to RELAXED ordering, this patch got 7.6%
> > > > performance improvement on N1 (making it generate A72 alike
> instructions).
> > > >
> > > > Could you please also try it on your testbed, Alex?
> > >
> > > Situation got better with this modification, here are the results:
> > >  - no patch:             3.0 Mpps CPU cycles/packet=51.52
> > >  - original patch:    2.1 Mpps CPU cycles/packet=71.05
> > >  - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found
> > > that the degradation is there only in case I enable bursts stats.
> >
> >
> > Great! So this patch will not hurt the normal datapath performance.
> >
> >
> > > Could you please turn on the following config options and see if you
> > > can reproduce this as well?
> > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y
> >
> > Thanks, Alex. Some updates.
> >
> > Slightly (about 1%) throughput degradation was detected after we
> > enabled these two config options on N1 SoC.
> >
> > If we look insight the perf stats results, with this patch, both
> > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the
> original code.
> > However, __memcpy_generic takes more cycles. I think that might be the
> > reason for CPU cycles per packet increment after applying this patch.
> >
> > Original code:
> > 98.07%--pkt_burst_io_forward
> >         |
> >         |--44.53%--__memcpy_generic
> >         |
> >         |--35.85%--mlx5_rx_burst_mprq
> >         |
> >         |--15.94%--mlx5_tx_burst_none_empw
> >         |          |
> >         |          |--7.32%--mlx5_tx_handle_completion.isra.0
> >         |          |
> >         |           --0.50%--__memcpy_generic
> >         |
> >          --1.14%--memcpy@plt
> >
> > Use C11 with RELAXED ordering:
> > 99.36%--pkt_burst_io_forward
> >         |
> >         |--47.40%--__memcpy_generic
> >         |
> >         |--34.62%--mlx5_rx_burst_mprq
> >         |
> >         |--15.55%--mlx5_tx_burst_none_empw
> >         |          |
> >         |           --7.08%--mlx5_tx_handle_completion.isra.0
> >         |
> >          --1.17%--memcpy@plt
> >
> > BTW, all the atomic operations in this patch are not the hotspot.
> 
> Phil, we are seeing much worse degradation on our ARM platform
> unfortunately.
> I don't think that discrepancy in memcpy can explain this behavior.
> Your patch is not touching this area of code. Let me collect some perf stat on
> our side.
Are you testing the patch as is or have you made the changes that were discussed in the thread?

> 
> >
> > >
> > > > >
> > > > > Can you replace just the above line with the following lines and test it?
> > > > >
> > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > > > >
> > > > > This should make the generated code same as before this patch.
> > > > > Let me know if you would prefer us to re-spin the patch instead
> > > > > (for
> > testing).
> > > > >
> > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > > > >  buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM);
> > > > > > > >  /*
> > > > > > > >   * MLX5 device doesn't use iova but it is necessary in a
> > > > > > > diff
> > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3
> > > > > > > > 100644
> > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > <snip>
> > > > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-08-11  5:20                   ` Honnappa Nagarahalli
@ 2020-09-02 21:52                     ` Alexander Kozyrev
  2020-09-03  2:55                       ` Phil Yang
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-09-02 21:52 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Alexander Kozyrev, Phil Yang, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd

> <snip>
> 
> > >
> > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void
> *dpdk_rxq,
> > > > > struct
> > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > > > > >
> > > > > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > > rte_atomic16_add_return includes a full barrier along with
> > > > > > atomic
> > > > > operation.
> > > > > > But is full barrier required here? For ex:
> > > > > > __atomic_add_fetch(&buf->refcnt, 1,
> > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would
> > > > > > that be enough?
> > > > > >
> > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > > > > >refcnt) <=
> > > > > > > > > -    strd_n + 1);
> > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > > > __ATOMIC_ACQUIRE);
> > > > >
> > > > > The atomic load in MLX5_ASSERT() accesses the same memory space
> > > > > as the previous __atomic_add_fetch() does.
> > > > > They will access this memory space in the program order when we
> > > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in
> > > > > __atomic_add_fetch() becomes unnecessary.
> > > > >
> > > > > By changing it to RELAXED ordering, this patch got 7.6%
> > > > > performance improvement on N1 (making it generate A72 alike
> > instructions).
> > > > >
> > > > > Could you please also try it on your testbed, Alex?
> > > >
> > > > Situation got better with this modification, here are the results:
> > > >  - no patch:             3.0 Mpps CPU cycles/packet=51.52
> > > >  - original patch:    2.1 Mpps CPU cycles/packet=71.05
> > > >  - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found
> > > > that the degradation is there only in case I enable bursts stats.
> > >
> > >
> > > Great! So this patch will not hurt the normal datapath performance.
> > >
> > >
> > > > Could you please turn on the following config options and see if
> > > > you can reproduce this as well?
> > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y
> > >
> > > Thanks, Alex. Some updates.
> > >
> > > Slightly (about 1%) throughput degradation was detected after we
> > > enabled these two config options on N1 SoC.
> > >
> > > If we look insight the perf stats results, with this patch, both
> > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the
> > original code.
> > > However, __memcpy_generic takes more cycles. I think that might be
> > > the reason for CPU cycles per packet increment after applying this patch.
> > >
> > > Original code:
> > > 98.07%--pkt_burst_io_forward
> > >         |
> > >         |--44.53%--__memcpy_generic
> > >         |
> > >         |--35.85%--mlx5_rx_burst_mprq
> > >         |
> > >         |--15.94%--mlx5_tx_burst_none_empw
> > >         |          |
> > >         |          |--7.32%--mlx5_tx_handle_completion.isra.0
> > >         |          |
> > >         |           --0.50%--__memcpy_generic
> > >         |
> > >          --1.14%--memcpy@plt
> > >
> > > Use C11 with RELAXED ordering:
> > > 99.36%--pkt_burst_io_forward
> > >         |
> > >         |--47.40%--__memcpy_generic
> > >         |
> > >         |--34.62%--mlx5_rx_burst_mprq
> > >         |
> > >         |--15.55%--mlx5_tx_burst_none_empw
> > >         |          |
> > >         |           --7.08%--mlx5_tx_handle_completion.isra.0
> > >         |
> > >          --1.17%--memcpy@plt
> > >
> > > BTW, all the atomic operations in this patch are not the hotspot.
> >
> > Phil, we are seeing much worse degradation on our ARM platform
> > unfortunately.
> > I don't think that discrepancy in memcpy can explain this behavior.
> > Your patch is not touching this area of code. Let me collect some perf
> > stat on our side.
> Are you testing the patch as is or have you made the changes that were
> discussed in the thread?
> 

Yes, I made the changes you suggested. It really gets better with them.
Could you please respin the patch to make sure I got it right in my environment?

> >
> > >
> > > >
> > > > > >
> > > > > > Can you replace just the above line with the following lines and test
> it?
> > > > > >
> > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > > > > >
> > > > > > This should make the generated code same as before this patch.
> > > > > > Let me know if you would prefer us to re-spin the patch
> > > > > > instead (for
> > > testing).
> > > > > >
> > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > > > > >  buf_addr = RTE_PTR_SUB(addr,
> RTE_PKTMBUF_HEADROOM);
> > > > > > > > >  /*
> > > > > > > > >   * MLX5 device doesn't use iova but it is necessary in
> > > > > > > > > a
> > > > > > > > diff
> > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > <snip>
> > > > > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-06-23  8:26   ` [dpdk-dev] [PATCH v3] net/mlx5: relaxed " Phil Yang
  2020-07-13  3:02     ` Phil Yang
@ 2020-09-03  2:53     ` Phil Yang
  2020-09-10  1:30       ` Honnappa Nagarahalli
  2020-09-30 13:57       ` Raslan Darawsheh
  1 sibling, 2 replies; 70+ messages in thread
From: Phil Yang @ 2020-09-03  2:53 UTC (permalink / raw)
  To: akozyrev, rasland, dev
  Cc: Honnappa.Nagarahalli, Phil.Yang, matan, shahafs, viacheslavo, nd

Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
which enforce unnecessary barriers on aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
---
v4:
Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)

v3:
Split from the patchset:
http://patchwork.dpdk.org/cover/68159/

 drivers/net/mlx5/mlx5_rxq.c  |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
 drivers/net/mlx5/mlx5_rxtx.h |  2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 79eb8f8..40e0239 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg,
 
 	memset(_m, 0, sizeof(*buf));
 	buf->mp = mp;
-	rte_atomic16_set(&buf->refcnt, 1);
+	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 	for (j = 0; j != strd_n; ++j) {
 		shinfo = &buf->shinfos[j];
 		shinfo->free_cb = mlx5_mprq_buf_free_cb;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 1b71e94..549477b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque)
 {
 	struct mlx5_mprq_buf *buf = opaque;
 
-	if (rte_atomic16_read(&buf->refcnt) == 1) {
+	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
 		rte_mempool_put(buf->mp, buf);
-	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
-		rte_atomic16_set(&buf->refcnt, 1);
+	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
+					       __ATOMIC_RELAXED) == 0)) {
+		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
 		rte_mempool_put(buf->mp, buf);
 	}
 }
@@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 		if (consumed_strd == strd_n) {
 			/* Replace WQE only if the buffer is still in use. */
-			if (rte_atomic16_read(&buf->refcnt) > 1) {
+			if (__atomic_load_n(&buf->refcnt,
+					    __ATOMIC_RELAXED) > 1) {
 				mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n);
 				/* Release the old buffer. */
 				mlx5_mprq_buf_free(buf);
@@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			void *buf_addr;
 
 			/* Increment the refcnt of the whole chunk. */
-			rte_atomic16_add_return(&buf->refcnt, 1);
-			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <=
-				    strd_n + 1);
+			__atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
+			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
+				    __ATOMIC_RELAXED) <= strd_n + 1);
 			buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM);
 			/*
 			 * MLX5 device doesn't use iova but it is necessary in a
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index c02a007..467f31d 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -68,7 +68,7 @@ struct rxq_zip {
 /* Multi-Packet RQ buffer header. */
 struct mlx5_mprq_buf {
 	struct rte_mempool *mp;
-	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
+	uint16_t refcnt; /* Atomically accessed refcnt. */
 	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */
 	struct rte_mbuf_ext_shared_info shinfos[];
 	/*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-02 21:52                     ` Alexander Kozyrev
@ 2020-09-03  2:55                       ` Phil Yang
  2020-09-09 13:29                         ` Alexander Kozyrev
  0 siblings, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-09-03  2:55 UTC (permalink / raw)
  To: Alexander Kozyrev, Honnappa Nagarahalli, Alexander Kozyrev,
	Matan Azrad, Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd, nd

<snip>
> >
> > > >
> > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void
> > *dpdk_rxq,
> > > > > > struct
> > > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > > > > > >
> > > > > > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > > > rte_atomic16_add_return includes a full barrier along with
> > > > > > > atomic
> > > > > > operation.
> > > > > > > But is full barrier required here? For ex:
> > > > > > > __atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would
> > > > > > > that be enough?
> > > > > > >
> > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > > > > > >refcnt) <=
> > > > > > > > > > -    strd_n + 1);
> > > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > > > > __ATOMIC_ACQUIRE);
> > > > > >
> > > > > > The atomic load in MLX5_ASSERT() accesses the same memory
> space
> > > > > > as the previous __atomic_add_fetch() does.
> > > > > > They will access this memory space in the program order when we
> > > > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in
> > > > > > __atomic_add_fetch() becomes unnecessary.
> > > > > >
> > > > > > By changing it to RELAXED ordering, this patch got 7.6%
> > > > > > performance improvement on N1 (making it generate A72 alike
> > > instructions).
> > > > > >
> > > > > > Could you please also try it on your testbed, Alex?
> > > > >
> > > > > Situation got better with this modification, here are the results:
> > > > >  - no patch:             3.0 Mpps CPU cycles/packet=51.52
> > > > >  - original patch:    2.1 Mpps CPU cycles/packet=71.05
> > > > >  - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found
> > > > > that the degradation is there only in case I enable bursts stats.
> > > >
> > > >
> > > > Great! So this patch will not hurt the normal datapath performance.
> > > >
> > > >
> > > > > Could you please turn on the following config options and see if
> > > > > you can reproduce this as well?
> > > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> > > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y
> > > >
> > > > Thanks, Alex. Some updates.
> > > >
> > > > Slightly (about 1%) throughput degradation was detected after we
> > > > enabled these two config options on N1 SoC.
> > > >
> > > > If we look insight the perf stats results, with this patch, both
> > > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the
> > > original code.
> > > > However, __memcpy_generic takes more cycles. I think that might be
> > > > the reason for CPU cycles per packet increment after applying this
> patch.
> > > >
> > > > Original code:
> > > > 98.07%--pkt_burst_io_forward
> > > >         |
> > > >         |--44.53%--__memcpy_generic
> > > >         |
> > > >         |--35.85%--mlx5_rx_burst_mprq
> > > >         |
> > > >         |--15.94%--mlx5_tx_burst_none_empw
> > > >         |          |
> > > >         |          |--7.32%--mlx5_tx_handle_completion.isra.0
> > > >         |          |
> > > >         |           --0.50%--__memcpy_generic
> > > >         |
> > > >          --1.14%--memcpy@plt
> > > >
> > > > Use C11 with RELAXED ordering:
> > > > 99.36%--pkt_burst_io_forward
> > > >         |
> > > >         |--47.40%--__memcpy_generic
> > > >         |
> > > >         |--34.62%--mlx5_rx_burst_mprq
> > > >         |
> > > >         |--15.55%--mlx5_tx_burst_none_empw
> > > >         |          |
> > > >         |           --7.08%--mlx5_tx_handle_completion.isra.0
> > > >         |
> > > >          --1.17%--memcpy@plt
> > > >
> > > > BTW, all the atomic operations in this patch are not the hotspot.
> > >
> > > Phil, we are seeing much worse degradation on our ARM platform
> > > unfortunately.
> > > I don't think that discrepancy in memcpy can explain this behavior.
> > > Your patch is not touching this area of code. Let me collect some perf
> > > stat on our side.
> > Are you testing the patch as is or have you made the changes that were
> > discussed in the thread?
> >
> 
> Yes, I made the changes you suggested. It really gets better with them.
> Could you please respin the patch to make sure I got it right in my
> environment?

Thanks, Alex.
Please check the new version here.
http://patchwork.dpdk.org/patch/76335/


> 
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > Can you replace just the above line with the following lines and
> test
> > it?
> > > > > > >
> > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > > > > > >
> > > > > > > This should make the generated code same as before this patch.
> > > > > > > Let me know if you would prefer us to re-spin the patch
> > > > > > > instead (for
> > > > testing).
> > > > > > >
> > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > > > > > >  buf_addr = RTE_PTR_SUB(addr,
> > RTE_PKTMBUF_HEADROOM);
> > > > > > > > > >  /*
> > > > > > > > > >   * MLX5 device doesn't use iova but it is necessary in
> > > > > > > > > > a
> > > > > > > > > diff
> > > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3
> > > > > > > > > > 100644
> > > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > <snip>
> > > > > > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-03  2:55                       ` Phil Yang
@ 2020-09-09 13:29                         ` Alexander Kozyrev
  2020-09-10  1:34                           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-09-09 13:29 UTC (permalink / raw)
  To: Phil Yang, Honnappa Nagarahalli, Alexander Kozyrev, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, nd, nd

> <snip>
> > >
> > > > >
> > > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void
> > > *dpdk_rxq,
> > > > > > > struct
> > > > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n)  void *buf_addr;
> > > > > > > > > > >
> > > > > > > > > > >  /* Increment the refcnt of the whole chunk. */
> > > > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > > > > rte_atomic16_add_return includes a full barrier along with
> > > > > > > > atomic
> > > > > > > operation.
> > > > > > > > But is full barrier required here? For ex:
> > > > > > > > __atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier.
> > > > > > > > Would that be enough?
> > > > > > > >
> > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > > > > > > > >refcnt) <=
> > > > > > > > > > > -    strd_n + 1);
> > > > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1,
> > > > > > > > > > > __ATOMIC_ACQUIRE);
> > > > > > >
> > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory
> > space
> > > > > > > as the previous __atomic_add_fetch() does.
> > > > > > > They will access this memory space in the program order when
> > > > > > > we enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in
> > > > > > > __atomic_add_fetch() becomes unnecessary.
> > > > > > >
> > > > > > > By changing it to RELAXED ordering, this patch got 7.6%
> > > > > > > performance improvement on N1 (making it generate A72 alike
> > > > instructions).
> > > > > > >
> > > > > > > Could you please also try it on your testbed, Alex?
> > > > > >
> > > > > > Situation got better with this modification, here are the results:
> > > > > >  - no patch:             3.0 Mpps CPU cycles/packet=51.52
> > > > > >  - original patch:    2.1 Mpps CPU cycles/packet=71.05
> > > > > >  - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I
> > > > > > found that the degradation is there only in case I enable bursts stats.
> > > > >
> > > > >
> > > > > Great! So this patch will not hurt the normal datapath performance.
> > > > >
> > > > >
> > > > > > Could you please turn on the following config options and see
> > > > > > if you can reproduce this as well?
> > > > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y
> > > > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y
> > > > >
> > > > > Thanks, Alex. Some updates.
> > > > >
> > > > > Slightly (about 1%) throughput degradation was detected after we
> > > > > enabled these two config options on N1 SoC.
> > > > >
> > > > > If we look insight the perf stats results, with this patch, both
> > > > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than
> > > > > the
> > > > original code.
> > > > > However, __memcpy_generic takes more cycles. I think that might
> > > > > be the reason for CPU cycles per packet increment after applying
> > > > > this
> > patch.
> > > > >
> > > > > Original code:
> > > > > 98.07%--pkt_burst_io_forward
> > > > >         |
> > > > >         |--44.53%--__memcpy_generic
> > > > >         |
> > > > >         |--35.85%--mlx5_rx_burst_mprq
> > > > >         |
> > > > >         |--15.94%--mlx5_tx_burst_none_empw
> > > > >         |          |
> > > > >         |          |--7.32%--mlx5_tx_handle_completion.isra.0
> > > > >         |          |
> > > > >         |           --0.50%--__memcpy_generic
> > > > >         |
> > > > >          --1.14%--memcpy@plt
> > > > >
> > > > > Use C11 with RELAXED ordering:
> > > > > 99.36%--pkt_burst_io_forward
> > > > >         |
> > > > >         |--47.40%--__memcpy_generic
> > > > >         |
> > > > >         |--34.62%--mlx5_rx_burst_mprq
> > > > >         |
> > > > >         |--15.55%--mlx5_tx_burst_none_empw
> > > > >         |          |
> > > > >         |           --7.08%--mlx5_tx_handle_completion.isra.0
> > > > >         |
> > > > >          --1.17%--memcpy@plt
> > > > >
> > > > > BTW, all the atomic operations in this patch are not the hotspot.
> > > >
> > > > Phil, we are seeing much worse degradation on our ARM platform
> > > > unfortunately.
> > > > I don't think that discrepancy in memcpy can explain this behavior.
> > > > Your patch is not touching this area of code. Let me collect some
> > > > perf stat on our side.
> > > Are you testing the patch as is or have you made the changes that
> > > were discussed in the thread?
> > >
> >
> > Yes, I made the changes you suggested. It really gets better with them.
> > Could you please respin the patch to make sure I got it right in my
> > environment?
> 
> Thanks, Alex.
> Please check the new version here.
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwo
> rk.dpdk.org%2Fpatch%2F76335%2F&amp;data=02%7C01%7Cakozyrev%40nvidi
> a.com%7C2486830050214bac8b9708d84fb4d9f9%7C43083d15727340c1b7db39
> efd9ccc17a%7C0%7C0%7C637346985463620568&amp;sdata=WGw0JZPcbjosSiI
> UxJuQz3r2pZBYkz%2BIXSqlOXimZdc%3D&amp;reserved=0

This patch is definitely better, do not see a degradation anymore, thank you.

Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>

> 
> >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > Can you replace just the above line with the following
> > > > > > > > lines and
> > test
> > > it?
> > > > > > > >
> > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL);
> > > > > > > >
> > > > > > > > This should make the generated code same as before this patch.
> > > > > > > > Let me know if you would prefer us to re-spin the patch
> > > > > > > > instead (for
> > > > > testing).
> > > > > > > >
> > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > > > > > > > > +    __ATOMIC_RELAXED) <= strd_n + 1);
> > > > > > > > > > >  buf_addr = RTE_PTR_SUB(addr,
> > > RTE_PKTMBUF_HEADROOM);
> > > > > > > > > > >  /*
> > > > > > > > > > >   * MLX5 device doesn't use iova but it is necessary
> > > > > > > > > > > in a
> > > > > > > > > > diff
> > > > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index
> > > > > > > > > > > 26621ff..0fc15f3
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > > <snip>
> > > > > > > >


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-03  2:53     ` [dpdk-dev] [PATCH v4] " Phil Yang
@ 2020-09-10  1:30       ` Honnappa Nagarahalli
  2020-09-10  1:36         ` Alexander Kozyrev
  2020-09-30 13:57       ` Raslan Darawsheh
  1 sibling, 1 reply; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-09-10  1:30 UTC (permalink / raw)
  To: Phil Yang, akozyrev, rasland, dev
  Cc: Phil Yang, matan, shahafs, viacheslavo, nd, Honnappa Nagarahalli, nd

<snip>

> 
> Use c11 atomics with RELAXED ordering instead of the rte_atomic ops which
> enforce unnecessary barriers on aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
Looks good.

Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> ---
> v4:
> Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
> 
> v3:
> Split from the patchset:
> http://patchwork.dpdk.org/cover/68159/
> 
>  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
>  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> drivers/net/mlx5/mlx5_rxtx.h |  2 +-
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index
> 79eb8f8..40e0239 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void *opaque_arg,
> 
>  	memset(_m, 0, sizeof(*buf));
>  	buf->mp = mp;
> -	rte_atomic16_set(&buf->refcnt, 1);
> +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
>  	for (j = 0; j != strd_n; ++j) {
>  		shinfo = &buf->shinfos[j];
>  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> 1b71e94..549477b 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused, void *opaque)  {
>  	struct mlx5_mprq_buf *buf = opaque;
> 
> -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
>  		rte_mempool_put(buf->mp, buf);
> -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> -		rte_atomic16_set(&buf->refcnt, 1);
> +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> +					       __ATOMIC_RELAXED) == 0)) {
> +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
>  		rte_mempool_put(buf->mp, buf);
>  	}
>  }
> @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
> 
>  		if (consumed_strd == strd_n) {
>  			/* Replace WQE only if the buffer is still in use. */
> -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> +			if (__atomic_load_n(&buf->refcnt,
> +					    __ATOMIC_RELAXED) > 1) {
>  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
>  				/* Release the old buffer. */
>  				mlx5_mprq_buf_free(buf);
> @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
>  			void *buf_addr;
> 
>  			/* Increment the refcnt of the whole chunk. */
> -			rte_atomic16_add_return(&buf->refcnt, 1);
> -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> >refcnt) <=
> -				    strd_n + 1);
> +			__atomic_add_fetch(&buf->refcnt, 1,
> __ATOMIC_RELAXED);
> +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> +				    __ATOMIC_RELAXED) <= strd_n + 1);
>  			buf_addr = RTE_PTR_SUB(addr,
> RTE_PKTMBUF_HEADROOM);
>  			/*
>  			 * MLX5 device doesn't use iova but it is necessary in
> a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index c02a007..467f31d 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -68,7 +68,7 @@ struct rxq_zip {
>  /* Multi-Packet RQ buffer header. */
>  struct mlx5_mprq_buf {
>  	struct rte_mempool *mp;
> -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> +	uint16_t refcnt; /* Atomically accessed refcnt. */
>  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet. */
>  	struct rte_mbuf_ext_shared_info shinfos[];
>  	/*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-09 13:29                         ` Alexander Kozyrev
@ 2020-09-10  1:34                           ` Honnappa Nagarahalli
  0 siblings, 0 replies; 70+ messages in thread
From: Honnappa Nagarahalli @ 2020-09-10  1:34 UTC (permalink / raw)
  To: Alexander Kozyrev, Phil Yang, Alexander Kozyrev, Matan Azrad,
	Shahaf Shuler, Slava Ovsiienko
  Cc: drc, nd, dev, Honnappa Nagarahalli, nd

<snip>

> > > > > Phil, we are seeing much worse degradation on our ARM platform
> > > > > unfortunately.
> > > > > I don't think that discrepancy in memcpy can explain this behavior.
> > > > > Your patch is not touching this area of code. Let me collect
> > > > > some perf stat on our side.
> > > > Are you testing the patch as is or have you made the changes that
> > > > were discussed in the thread?
> > > >
> > >
> > > Yes, I made the changes you suggested. It really gets better with them.
> > > Could you please respin the patch to make sure I got it right in my
> > > environment?
> >
> > Thanks, Alex.
> > Please check the new version here.
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > wo
> rk.dpdk.org%2Fpatch%2F76335%2F&amp;data=02%7C01%7Cakozyrev%40nv
> idi
> >
> a.com%7C2486830050214bac8b9708d84fb4d9f9%7C43083d15727340c1b7d
> b39
> >
> efd9ccc17a%7C0%7C0%7C637346985463620568&amp;sdata=WGw0JZPcbjos
> SiI
> > UxJuQz3r2pZBYkz%2BIXSqlOXimZdc%3D&amp;reserved=0
> 
> This patch is definitely better, do not see a degradation anymore, thank you.
> 
> Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>

Can you please add your Acked-by tag in v4 (Phil provided the link above)?

<snip>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-10  1:30       ` Honnappa Nagarahalli
@ 2020-09-10  1:36         ` Alexander Kozyrev
  2020-09-29 15:22           ` Phil Yang
  0 siblings, 1 reply; 70+ messages in thread
From: Alexander Kozyrev @ 2020-09-10  1:36 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, akozyrev, rasland, dev
  Cc: Phil Yang, matan, Shahaf Shuler, viacheslavo, nd, nd

> <snip>
> 
> >
> > Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
> > which enforce unnecessary barriers on aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> Looks good.
> 
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>

> 
> > ---
> > v4:
> > Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
> >
> > v3:
> > Split from the patchset:
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%40nv
> idia.
> >
> com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db39ef
> d9ccc
> >
> 17a%7C0%7C0%7C637352982762038088&amp;sdata=0HzTxbzh0Dqk0hZ5PIgEV
> zieyV%
> > 2BnLTivsVIFFxXFAtI%3D&amp;reserved=0
> >
> >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> >  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> >  3 files changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > index
> > 79eb8f8..40e0239 100644
> > --- a/drivers/net/mlx5/mlx5_rxq.c
> > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void
> > *opaque_arg,
> >
> >  	memset(_m, 0, sizeof(*buf));
> >  	buf->mp = mp;
> > -	rte_atomic16_set(&buf->refcnt, 1);
> > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> >  	for (j = 0; j != strd_n; ++j) {
> >  		shinfo = &buf->shinfos[j];
> >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > 1b71e94..549477b 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused,
> > void *opaque)  {
> >  	struct mlx5_mprq_buf *buf = opaque;
> >
> > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> >  		rte_mempool_put(buf->mp, buf);
> > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > -		rte_atomic16_set(&buf->refcnt, 1);
> > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > +					       __ATOMIC_RELAXED) == 0)) {
> > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> >  		rte_mempool_put(buf->mp, buf);
> >  	}
> >  }
> > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> >
> >  		if (consumed_strd == strd_n) {
> >  			/* Replace WQE only if the buffer is still in use. */
> > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > +			if (__atomic_load_n(&buf->refcnt,
> > +					    __ATOMIC_RELAXED) > 1) {
> >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
> >  				/* Release the old buffer. */
> >  				mlx5_mprq_buf_free(buf);
> > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> >  			void *buf_addr;
> >
> >  			/* Increment the refcnt of the whole chunk. */
> > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > >refcnt) <=
> > -				    strd_n + 1);
> > +			__atomic_add_fetch(&buf->refcnt, 1,
> > __ATOMIC_RELAXED);
> > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> >  			buf_addr = RTE_PTR_SUB(addr,
> > RTE_PKTMBUF_HEADROOM);
> >  			/*
> >  			 * MLX5 device doesn't use iova but it is necessary in a
> diff
> > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > index c02a007..467f31d 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > @@ -68,7 +68,7 @@ struct rxq_zip {
> >  /* Multi-Packet RQ buffer header. */
> >  struct mlx5_mprq_buf {
> >  	struct rte_mempool *mp;
> > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet.
> > */
> >  	struct rte_mbuf_ext_shared_info shinfos[];
> >  	/*
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-10  1:36         ` Alexander Kozyrev
@ 2020-09-29 15:22           ` Phil Yang
  2020-09-30 12:44             ` Slava Ovsiienko
  0 siblings, 1 reply; 70+ messages in thread
From: Phil Yang @ 2020-09-29 15:22 UTC (permalink / raw)
  To: rasland, matan, Shahaf Shuler
  Cc: nd, Alexander Kozyrev, Honnappa Nagarahalli, dev, nd

Hi Raslan,

It seems that there are no more comments for this patch.
So shall we proceed further?

Thanks,
Phil Yang

> -----Original Message-----
> From: Alexander Kozyrev <akozyrev@nvidia.com>
> Sent: Thursday, September 10, 2020 9:37 AM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> <Phil.Yang@arm.com>; akozyrev@mellanox.com; rasland@mellanox.com;
> dev@dpdk.org
> Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf Shuler
> <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd <nd@arm.com>;
> nd <nd@arm.com>
> Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> buffer refcnt
> 
> > <snip>
> >
> > >
> > > Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
> > > which enforce unnecessary barriers on aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Looks good.
> >
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
> 
> >
> > > ---
> > > v4:
> > > Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
> > >
> > > v3:
> > > Split from the patchset:
> > >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%40
> nv
> > idia.
> > >
> >
> com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db39e
> f
> > d9ccc
> > >
> >
> 17a%7C0%7C0%7C637352982762038088&amp;sdata=0HzTxbzh0Dqk0hZ5PIgE
> V
> > zieyV%
> > > 2BnLTivsVIFFxXFAtI%3D&amp;reserved=0
> > >
> > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> > >  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > > index
> > > 79eb8f8..40e0239 100644
> > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void
> > > *opaque_arg,
> > >
> > >  	memset(_m, 0, sizeof(*buf));
> > >  	buf->mp = mp;
> > > -	rte_atomic16_set(&buf->refcnt, 1);
> > > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  	for (j = 0; j != strd_n; ++j) {
> > >  		shinfo = &buf->shinfos[j];
> > >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > > 1b71e94..549477b 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > __rte_unused,
> > > void *opaque)  {
> > >  	struct mlx5_mprq_buf *buf = opaque;
> > >
> > > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > >  		rte_mempool_put(buf->mp, buf);
> > > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > -		rte_atomic16_set(&buf->refcnt, 1);
> > > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > +					       __ATOMIC_RELAXED) == 0)) {
> > > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > >  		rte_mempool_put(buf->mp, buf);
> > >  	}
> > >  }
> > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >
> > >  		if (consumed_strd == strd_n) {
> > >  			/* Replace WQE only if the buffer is still in use. */
> > > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > +			if (__atomic_load_n(&buf->refcnt,
> > > +					    __ATOMIC_RELAXED) > 1) {
> > >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> > strd_n);
> > >  				/* Release the old buffer. */
> > >  				mlx5_mprq_buf_free(buf);
> > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >  			void *buf_addr;
> > >
> > >  			/* Increment the refcnt of the whole chunk. */
> > > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > >refcnt) <=
> > > -				    strd_n + 1);
> > > +			__atomic_add_fetch(&buf->refcnt, 1,
> > > __ATOMIC_RELAXED);
> > > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> > >  			buf_addr = RTE_PTR_SUB(addr,
> > > RTE_PKTMBUF_HEADROOM);
> > >  			/*
> > >  			 * MLX5 device doesn't use iova but it is necessary in
> a
> > diff
> > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > > index c02a007..467f31d 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > >  /* Multi-Packet RQ buffer header. */
> > >  struct mlx5_mprq_buf {
> > >  	struct rte_mempool *mp;
> > > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> > >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > packet.
> > > */
> > >  	struct rte_mbuf_ext_shared_info shinfos[];
> > >  	/*
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-29 15:22           ` Phil Yang
@ 2020-09-30 12:44             ` Slava Ovsiienko
  2020-09-30 12:52               ` Raslan Darawsheh
  0 siblings, 1 reply; 70+ messages in thread
From: Slava Ovsiienko @ 2020-09-30 12:44 UTC (permalink / raw)
  To: Phil Yang, Raslan Darawsheh, Matan Azrad, Shahaf Shuler
  Cc: nd, Alexander Kozyrev, Honnappa Nagarahalli, dev, nd

Looks good to me, and we've checked the performance has no impact.
Thank you.

Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, September 29, 2020 18:23
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-
> packet RQ buffer refcnt
> 
> Hi Raslan,
> 
> It seems that there are no more comments for this patch.
> So shall we proceed further?
> 
> Thanks,
> Phil Yang
> 
> > -----Original Message-----
> > From: Alexander Kozyrev <akozyrev@nvidia.com>
> > Sent: Thursday, September 10, 2020 9:37 AM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> > <Phil.Yang@arm.com>; akozyrev@mellanox.com; rasland@mellanox.com;
> > dev@dpdk.org
> > Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf Shuler
> > <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd
> <nd@arm.com>; nd
> > <nd@arm.com>
> > Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> > buffer refcnt
> >
> > > <snip>
> > >
> > > >
> > > > Use c11 atomics with RELAXED ordering instead of the rte_atomic
> > > > ops which enforce unnecessary barriers on aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Looks good.
> > >
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
> >
> > >
> > > > ---
> > > > v4:
> > > > Remove the unnecessary ACQUIRE barrier in rx burst path.
> > > > (Honnappa)
> > > >
> > > > v3:
> > > > Split from the patchset:
> > > >
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > > >
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%4
> 0
> > nv
> > > idia.
> > > >
> > >
> >
> com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db3
> 9e
> > f
> > > d9ccc
> > > >
> > >
> >
> 17a%7C0%7C0%7C637352982762038088&amp;sdata=0HzTxbzh0Dqk0hZ5PI
> gE
> > V
> > > zieyV%
> > > > 2BnLTivsVIFFxXFAtI%3D&amp;reserved=0
> > > >
> > > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> > > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > b/drivers/net/mlx5/mlx5_rxq.c index
> > > > 79eb8f8..40e0239 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> *mp,
> > void
> > > > *opaque_arg,
> > > >
> > > >  	memset(_m, 0, sizeof(*buf));
> > > >  	buf->mp = mp;
> > > > -	rte_atomic16_set(&buf->refcnt, 1);
> > > > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > >  	for (j = 0; j != strd_n; ++j) {
> > > >  		shinfo = &buf->shinfos[j];
> > > >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> > > > index 1b71e94..549477b 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > __rte_unused,
> > > > void *opaque)  {
> > > >  	struct mlx5_mprq_buf *buf = opaque;
> > > >
> > > > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > > >  		rte_mempool_put(buf->mp, buf);
> > > > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > -		rte_atomic16_set(&buf->refcnt, 1);
> > > > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > +					       __ATOMIC_RELAXED) == 0)) {
> > > > +		__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > >  		rte_mempool_put(buf->mp, buf);
> > > >  	}
> > > >  }
> > > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > >
> > > >  		if (consumed_strd == strd_n) {
> > > >  			/* Replace WQE only if the buffer is still in use. */
> > > > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > +			if (__atomic_load_n(&buf->refcnt,
> > > > +					    __ATOMIC_RELAXED) > 1) {
> > > >  				mprq_buf_replace(rxq, rq_ci & wq_mask,
> > > strd_n);
> > > >  				/* Release the old buffer. */
> > > >  				mlx5_mprq_buf_free(buf);
> > > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > >  			void *buf_addr;
> > > >
> > > >  			/* Increment the refcnt of the whole chunk. */
> > > > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > > > -			MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > >refcnt) <=
> > > > -				    strd_n + 1);
> > > > +			__atomic_add_fetch(&buf->refcnt, 1,
> > > > __ATOMIC_RELAXED);
> > > > +			MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > +				    __ATOMIC_RELAXED) <= strd_n + 1);
> > > >  			buf_addr = RTE_PTR_SUB(addr,
> > > > RTE_PKTMBUF_HEADROOM);
> > > >  			/*
> > > >  			 * MLX5 device doesn't use iova but it is necessary in
> > a
> > > diff
> > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > b/drivers/net/mlx5/mlx5_rxtx.h index c02a007..467f31d 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > > >  /* Multi-Packet RQ buffer header. */  struct mlx5_mprq_buf {
> > > >  	struct rte_mempool *mp;
> > > > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> > > >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > > packet.
> > > > */
> > > >  	struct rte_mbuf_ext_shared_info shinfos[];
> > > >  	/*
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-30 12:44             ` Slava Ovsiienko
@ 2020-09-30 12:52               ` Raslan Darawsheh
  0 siblings, 0 replies; 70+ messages in thread
From: Raslan Darawsheh @ 2020-09-30 12:52 UTC (permalink / raw)
  To: Slava Ovsiienko, Phil Yang, Matan Azrad, Shahaf Shuler
  Cc: nd, Alexander Kozyrev, Honnappa Nagarahalli, dev, nd

Sure, I'll take the patch since I was waiting for performance impact testing, and since Slava just confirmed it, 
Then we can take it yes thank you for your contribution. 

Kindest regards,
Raslan Darawsheh

> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Wednesday, September 30, 2020 3:45 PM
> To: Phil Yang <Phil.Yang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> nd <nd@arm.com>
> Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> buffer refcnt
> 
> Looks good to me, and we've checked the performance has no impact.
> Thank you.
> 
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > Sent: Tuesday, September 29, 2020 18:23
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-
> > packet RQ buffer refcnt
> >
> > Hi Raslan,
> >
> > It seems that there are no more comments for this patch.
> > So shall we proceed further?
> >
> > Thanks,
> > Phil Yang
> >
> > > -----Original Message-----
> > > From: Alexander Kozyrev <akozyrev@nvidia.com>
> > > Sent: Thursday, September 10, 2020 9:37 AM
> > > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> > > <Phil.Yang@arm.com>; akozyrev@mellanox.com;
> rasland@mellanox.com;
> > > dev@dpdk.org
> > > Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf
> Shuler
> > > <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd
> > <nd@arm.com>; nd
> > > <nd@arm.com>
> > > Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> > > buffer refcnt
> > >
> > > > <snip>
> > > >
> > > > >
> > > > > Use c11 atomics with RELAXED ordering instead of the rte_atomic
> > > > > ops which enforce unnecessary barriers on aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Looks good.
> > > >
> > > > Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> > >
> > > Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
> > >
> > > >
> > > > > ---
> > > > > v4:
> > > > > Remove the unnecessary ACQUIRE barrier in rx burst path.
> > > > > (Honnappa)
> > > > >
> > > > > v3:
> > > > > Split from the patchset:
> > > > >
> > >
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > > > >
> > > >
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Cakozyrev%4
> > 0
> > > nv
> > > > idia.
> > > > >
> > > >
> > >
> > com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db3
> > 9e
> > > f
> > > > d9ccc
> > > > >
> > > >
> > >
> > 17a%7C0%7C0%7C637352982762038088&amp;sdata=0HzTxbzh0Dqk0hZ5PI
> > gE
> > > V
> > > > zieyV%
> > > > > 2BnLTivsVIFFxXFAtI%3D&amp;reserved=0
> > > > >
> > > > >  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
> > > > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > > > drivers/net/mlx5/mlx5_rxtx.h |  2 +-
> > > > >  3 files changed, 11 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > > b/drivers/net/mlx5/mlx5_rxq.c index
> > > > > 79eb8f8..40e0239 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> > *mp,
> > > void
> > > > > *opaque_arg,
> > > > >
> > > > >  	memset(_m, 0, sizeof(*buf));
> > > > >  	buf->mp = mp;
> > > > > -	rte_atomic16_set(&buf->refcnt, 1);
> > > > > +	__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > >  	for (j = 0; j != strd_n; ++j) {
> > > > >  		shinfo = &buf->shinfos[j];
> > > > >  		shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > index 1b71e94..549477b 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > > __rte_unused,
> > > > > void *opaque)  {
> > > > >  	struct mlx5_mprq_buf *buf = opaque;
> > > > >
> > > > > -	if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > > +	if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) ==
> 1) {
> > > > >  		rte_mempool_put(buf->mp, buf);
> > > > > -	} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > > -		rte_atomic16_set(&buf->refcnt, 1);
> > > > > +	} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > > +					       __ATOMIC_RELAXED) ==
> 0)) {
> > > > > +		__atomic_store_n(&buf->refcnt, 1,
> __ATOMIC_RELAXED);
> > > > >  		rte_mempool_put(buf->mp, buf);
> > > > >  	}
> > > > >  }
> > > > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > >
> > > > >  		if (consumed_strd == strd_n) {
> > > > >  			/* Replace WQE only if the buffer is still in
> use. */
> > > > > -			if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > > +			if (__atomic_load_n(&buf->refcnt,
> > > > > +					    __ATOMIC_RELAXED) > 1) {
> > > > >  				mprq_buf_replace(rxq, rq_ci &
> wq_mask,
> > > > strd_n);
> > > > >  				/* Release the old buffer. */
> > > > >  				mlx5_mprq_buf_free(buf);
> > > > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > >  			void *buf_addr;
> > > > >
> > > > >  			/* Increment the refcnt of the whole chunk.
> */
> > > > > -			rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > -
> 	MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > >refcnt) <=
> > > > > -				    strd_n + 1);
> > > > > +			__atomic_add_fetch(&buf->refcnt, 1,
> > > > > __ATOMIC_RELAXED);
> > > > > +			MLX5_ASSERT(__atomic_load_n(&buf-
> >refcnt,
> > > > > +				    __ATOMIC_RELAXED) <= strd_n +
> 1);
> > > > >  			buf_addr = RTE_PTR_SUB(addr,
> > > > > RTE_PKTMBUF_HEADROOM);
> > > > >  			/*
> > > > >  			 * MLX5 device doesn't use iova but it is
> necessary in
> > > a
> > > > diff
> > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > b/drivers/net/mlx5/mlx5_rxtx.h index c02a007..467f31d 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > > > >  /* Multi-Packet RQ buffer header. */  struct mlx5_mprq_buf {
> > > > >  	struct rte_mempool *mp;
> > > > > -	rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > > > +	uint16_t refcnt; /* Atomically accessed refcnt. */
> > > > >  	uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for
> the first
> > > > packet.
> > > > > */
> > > > >  	struct rte_mbuf_ext_shared_info shinfos[];
> > > > >  	/*
> > > > > --
> > > > > 2.7.4


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt
  2020-09-03  2:53     ` [dpdk-dev] [PATCH v4] " Phil Yang
  2020-09-10  1:30       ` Honnappa Nagarahalli
@ 2020-09-30 13:57       ` Raslan Darawsheh
  1 sibling, 0 replies; 70+ messages in thread
From: Raslan Darawsheh @ 2020-09-30 13:57 UTC (permalink / raw)
  To: Phil Yang, akozyrev, rasland, dev
  Cc: Honnappa.Nagarahalli, matan, Shahaf Shuler, viacheslavo, nd

Hi,

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Thursday, September 3, 2020 5:53 AM
> To: akozyrev@mellanox.com; rasland@mellanox.com; dev@dpdk.org
> Cc: Honnappa.Nagarahalli@arm.com; Phil.Yang@arm.com;
> matan@mellanox.com; Shahaf Shuler <shahafs@mellanox.com>;
> viacheslavo@mellanox.com; nd@arm.com
> Subject: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer
> refcnt
> 
> Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
> which enforce unnecessary barriers on aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> ---
> v4:
> Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
> 
> v3:
> Split from the patchset:
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> work.dpdk.org%2Fcover%2F68159%2F&amp;data=02%7C01%7Crasland%40n
> vidia.com%7Ca89037e982b84575f58808d84fb493d9%7C43083d15727340c1b7d
> b39efd9ccc17a%7C0%7C0%7C637346984280298342&amp;sdata=hjWErvvV%2
> Fg%2Bpb%2FMMzepHuRK0Weftf%2BFC2NNWsyoWSyo%3D&amp;reserved
> =0
> 
>  drivers/net/mlx5/mlx5_rxq.c  |  2 +-
>  drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
>  drivers/net/mlx5/mlx5_rxtx.h |  2 +-
>  3 files changed, 11 insertions(+), 9 deletions(-)
> 

Patch applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2020-09-30 13:57 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 12:38 [dpdk-dev] [PATCH RFC v1 0/6] barrier fix and optimization for mlx5 on aarch64 Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 1/6] net/mlx5: relax the barrier for UAR write Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 2/6] net/mlx5: use cio barrier before the BF WQE Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 3/6] net/mlx5: add missing barrier Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 4/6] net/mlx5: add descriptive comment for a barrier Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 5/6] net/mlx5: non-cacheable mapping defaulted for aarch64 Gavin Hu
2020-02-13 12:38 ` [dpdk-dev] [PATCH RFC v1 6/6] net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and use it for mlx5 PMD Gavin Hu
2020-04-10 17:20   ` Andrew Rybchenko
2020-04-11  3:46     ` Gavin Hu
2020-04-13  9:51       ` Andrew Rybchenko
2020-04-13 16:46         ` Gavin Hu
2020-05-11 18:06   ` [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
2020-05-12  6:18     ` Ruifeng Wang
2020-05-12  6:42       ` Jerin Jacob
2020-05-12  8:02         ` Ruifeng Wang
2020-05-12  8:28           ` Jerin Jacob
2020-05-12 21:44           ` Honnappa Nagarahalli
2020-05-13 14:49             ` Jerin Jacob
2020-05-14  1:02               ` Honnappa Nagarahalli
2020-06-27 19:12   ` [dpdk-dev] [PATCH v2] " Honnappa Nagarahalli
2020-06-27 19:25     ` Honnappa Nagarahalli
2020-06-30  5:13       ` Jerin Jacob
2020-07-03 18:57   ` [dpdk-dev] [PATCH v3 1/3] " Honnappa Nagarahalli
2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
2020-07-05  0:57       ` Jerin Jacob
2020-07-03 18:57     ` [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
2020-07-05  0:57       ` Jerin Jacob
2020-07-07 20:19       ` Ajit Khaparde
2020-07-08 11:05       ` Ananyev, Konstantin
2020-07-06 23:43   ` [dpdk-dev] [PATCH v4 1/3] eal: adjust barriers for IO on Armv8-a Honnappa Nagarahalli
2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier changes Honnappa Nagarahalli
2020-07-07  8:36       ` David Marchand
2020-07-07 18:37         ` Honnappa Nagarahalli
2020-07-06 23:43     ` [dpdk-dev] [PATCH v4 3/3] doc: update deprecation of CIO barrier APIs Honnappa Nagarahalli
2020-07-07  8:39       ` David Marchand
2020-07-07 20:14       ` David Christensen
2020-07-08 11:49       ` David Marchand
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 1/7] eal: introduce new class of barriers for DMA use cases Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 2/7] net/mlx5: dmb for immediate doorbell ring on aarch64 Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 3/7] net/mlx5: relax barrier to order UAR writes " Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 4/7] net/mlx5: relax barrier for aarch64 Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 5/7] net/mlx5: add descriptive comment for a barrier Gavin Hu
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 6/7] net/mlx5: relax ordering for multi-packet RQ buffer refcnt Gavin Hu
2020-06-23  8:26   ` [dpdk-dev] [PATCH v3] net/mlx5: relaxed " Phil Yang
2020-07-13  3:02     ` Phil Yang
2020-07-20 23:21       ` Alexander Kozyrev
2020-07-21  1:55         ` Phil Yang
2020-07-21  3:58           ` Alexander Kozyrev
2020-07-21  4:03             ` Honnappa Nagarahalli
2020-07-21  4:11               ` Alexander Kozyrev
2020-07-22 12:06                 ` Phil Yang
2020-07-23  4:47         ` Honnappa Nagarahalli
2020-07-23  6:11           ` Phil Yang
2020-07-23 16:53             ` Alexander Kozyrev
2020-07-27 14:52               ` Phil Yang
2020-08-06  2:43                 ` Alexander Kozyrev
2020-08-11  5:20                   ` Honnappa Nagarahalli
2020-09-02 21:52                     ` Alexander Kozyrev
2020-09-03  2:55                       ` Phil Yang
2020-09-09 13:29                         ` Alexander Kozyrev
2020-09-10  1:34                           ` Honnappa Nagarahalli
2020-09-03  2:53     ` [dpdk-dev] [PATCH v4] " Phil Yang
2020-09-10  1:30       ` Honnappa Nagarahalli
2020-09-10  1:36         ` Alexander Kozyrev
2020-09-29 15:22           ` Phil Yang
2020-09-30 12:44             ` Slava Ovsiienko
2020-09-30 12:52               ` Raslan Darawsheh
2020-09-30 13:57       ` Raslan Darawsheh
2020-04-10 16:41 ` [dpdk-dev] [PATCH RFC v2 7/7] doc: clarify one configuration in mlx5 guide Gavin Hu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).