Using just sufficient barriers really matters to performance. Insufficient barriers will cause issues while barriers stronger than required, especially in the fast path is a performance killer. In the joint preliminary testing between Arm and Ampere, 8%~13% performance was measured. Gavin Hu (5): net/mlx5: relax the barrier for UAR write net/mlx5: use cio barrier before the BF WQE net/mlx5: add missing barrier net/mlx5: add descriptive comment for a barrier net/mlx5: non-cacheable mapping defaulted for aarch64 Phil Yang (1): net/mlx5: relaxed ordering for multi-packet RQ buffer refcnt drivers/net/mlx5/mlx5_rxq.c | 5 +++-- drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 11 ++++++++--- drivers/net/mlx5/mlx5_txq.c | 4 ++++ 4 files changed, 24 insertions(+), 12 deletions(-) -- 2.17.1
The UAR is part of PCI address space that is mapped for direct access to the HCA from the CPU. Read-Write accesses to this space are strongly ordered thus a compiler barrier is sufficient for all arches. This patch set is based on the following aarch64 architecural facts: 1. The PCI BAR space is mapped as nGnRE device memory, not cachable nor write-combine. 2. io accesses to a single device is total ordered. Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems") Cc: stable@dpdk.org Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 939778aa5..50b3cc3c9 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -546,7 +546,7 @@ __mlx5_uar_write64_relaxed(uint64_t val, void *addr, static __rte_always_inline void __mlx5_uar_write64(uint64_t val, void *addr, rte_spinlock_t *lock) { - rte_io_wmb(); + rte_compiler_barrier(); __mlx5_uar_write64_relaxed(val, addr, lock); } -- 2.17.1
To ensure the WQE and doorbell record, which reside in the host memory, are visible to HW before the blue frame, a CIO barrier is sufficient, a rte_wmb is overkill. Fixes: 6cb559d67b83 ("net/mlx5: add vectorized Rx/Tx burst for x86") Cc: stable@dpdk.org Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 50b3cc3c9..c672af4c4 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -658,7 +658,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe, rte_cio_wmb(); *txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci); /* Ensure ordering between DB record and BF copy. */ - rte_wmb(); + rte_cio_wmb(); mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock); if (cond) rte_wmb(); -- 2.17.1
To keep order of the modification of RX queue descriptor(rxq->cq_db) and the CQ doorbell register, a rte_cio_wmb barrier is required. The situation was rescued by the stronger than required barrier in the mlx5_uar_write64, it becomes a must when the barrier is relaxed. Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems") Cc: stable@dpdk.org Suggested-by: Phil Yang <phil.yang@arm.com> Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxq.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index dc0fd8211..2d1b153a3 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -856,7 +856,8 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq) doorbell = (uint64_t)doorbell_hi << 32; doorbell |= rxq->cqn; rxq->cq_db[MLX5_CQ_ARM_DB] = rte_cpu_to_be_32(doorbell_hi); - mlx5_uar_write64(rte_cpu_to_be_64(doorbell), + rte_cio_wmb(); + mlx5_uar_write64_relaxed(rte_cpu_to_be_64(doorbell), cq_db_reg, rxq->uar_lock_cq); } -- 2.17.1
The barrier is not required or can be moved down if HW waits for the doorbell ring to execute the WQE. This is not the case as HW can start executing the WQE until it gets the ownership(passed by SW writing the doorbell record). Add a decriptive comment for this HW specific behavior. Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index c672af4c4..d32c4e430 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -655,6 +655,11 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe, uint64_t *dst = MLX5_TX_BFREG(txq); volatile uint64_t *src = ((volatile uint64_t *)wqe); + /* The ownership of WQE is passed to HW by updating the doorbell + * record. Once WQE ownership has been passed to the HCA, HW can + * execute the WQE. The barrier is to ensure the WQE are visible + * to HW before HW execute it. + */ rte_cio_wmb(); *txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci); /* Ensure ordering between DB record and BF copy. */ -- 2.17.1
aarch64 does not map pci resources to 'write-combine' nor cacheable. In Linux Kernel arch_can_pci_mmap_wc() equals to 0 on aarch64[1]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ tree/drivers/pci/pci-sysfs.c?h=v5.4#n1204 Fixes: f078ceb6ae93 ("net/mlx5: fix Tx doorbell write memory barrier") Cc: stable@dpdk.org Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_txq.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c index bc13abfe6..144bab4a6 100644 --- a/drivers/net/mlx5/mlx5_txq.c +++ b/drivers/net/mlx5/mlx5_txq.c @@ -319,7 +319,11 @@ txq_uar_ncattr_init(struct mlx5_txq_ctrl *txq_ctrl, size_t page_size) off_t cmd; txq_ctrl->txq.db_heu = priv->config.dbnc == MLX5_TXDB_HEURISTIC; +#ifdef RTE_ARCH_ARM64 + txq_ctrl->txq.db_nc = 1; +#else txq_ctrl->txq.db_nc = 0; +#endif /* Check the doorbell register mapping type. */ cmd = txq_ctrl->uar_mmap_offset / page_size; cmd >>= MLX5_UAR_MMAP_CMD_SHIFT; -- 2.17.1
From: Phil Yang <phil.yang@arm.com> PMD Rx queue descriptor contains two mlx5_mprq_buf fields, which are the multi-packet RQ buffer header pointers. It uses the common rte_atomic_XXX functions to make sure the refcnt access is atomic. The common rte_atomic_XXX functions are full barriers on aarch64. Optimized it with one-way barrier to improve performance. Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support") Cc: stable@dpdk.org Suggested-by: Gavin Hu <gavin.hu@arm.com> Signed-off-by: Phil Yang <phil.yang@arm.com> Reviewed-by: Gavin Hu <gavin.hu@arm.com> --- drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index 2d1b153a3..765bb1af5 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -1535,7 +1535,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg, memset(_m, 0, sizeof(*buf)); buf->mp = mp; - rte_atomic16_set(&buf->refcnt, 1); + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); for (j = 0; j != strd_n; ++j) { shinfo = &buf->shinfos[j]; shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index 5eea932d4..0e7519c56 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1592,10 +1592,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque) { struct mlx5_mprq_buf *buf = opaque; - if (rte_atomic16_read(&buf->refcnt) == 1) { + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { rte_mempool_put(buf->mp, buf); - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { - rte_atomic16_set(&buf->refcnt, 1); + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, + __ATOMIC_RELAXED) == 0)) { + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); rte_mempool_put(buf->mp, buf); } } @@ -1676,7 +1677,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) if (consumed_strd == strd_n) { /* Replace WQE only if the buffer is still in use. */ - if (rte_atomic16_read(&buf->refcnt) > 1) { + if (__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) > 1) { mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n); /* Release the old buffer. */ mlx5_mprq_buf_free(buf); @@ -1766,9 +1768,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; /* Increment the refcnt of the whole chunk. */ - rte_atomic16_add_return(&buf->refcnt, 1); - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <= - strd_n + 1); + __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE); + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) <= strd_n + 1); buf_addr = RTE_PTR_SUB(addr, headroom_sz); /* * MLX5 device doesn't use iova but it is necessary in a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index d32c4e430..1f453fe09 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -78,7 +78,7 @@ struct rxq_zip { /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { struct rte_mempool *mp; - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ + uint16_t refcnt; /* Atomically accessed refcnt. */ uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */ struct rte_mbuf_ext_shared_info shinfos[]; /* -- 2.17.1
To order writes to various memory types, 'sfence' is required for x86, and 'dmb oshst' is required for aarch64. But within DPDK, there is no abstracted barriers covers this combination: sfence(x86)/dmb(aarch64). So introduce a new barrier class - rte_dma_*mb for this combination, Doorbell rings are typical use cases of this new barrier class, which requires something ready in the memory before letting HW aware. As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while rte_wmb is 'dsb' for aarch64. In the joint preliminary testing between Arm and Ampere, 8%~13% performance boost was measured. As there is no functionality changes, it will not impact x86. Gavin Hu (6): eal: introduce new class of barriers for DMA use cases net/mlx5: dmb for immediate doorbell ring on aarch64 net/mlx5: relax barrier to order UAR writes on aarch64 net/mlx5: relax barrier for aarch64 net/mlx5: add descriptive comment for a barrier doc: clarify one configuration in mlx5 guide Phil Yang (1): net/mlx5: relax ordering for multi-packet RQ buffer refcnt doc/guides/nics/mlx5.rst | 6 ++-- drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ 9 files changed, 78 insertions(+), 15 deletions(-) -- 2.17.1
In DPDK we use rte_*mb barriers to ensure that memory accesses to DMA regions are observed before MMIO accesses to hardware registers. On AArch64, the rte_*mb barriers are implemented by "DSB" (Data Synchronisation Barrier) style instructions which are the strongest barriers possible. Recently, however, it has been realised [1], that for devices where the MMIO regions are shared between all CPUs, that it is possible to relax this memory barrier. There are cases where we wish to retain the strength of the rte_*mb memory barriers; thus rather than relax rte_*mb we opt instead to introduce a new class of barrier rte_dma_*mb. For AArch64, rte_dma_*mb will be implemented by a relaxed "DMB OSH" style of barrier. For other architectures, we implement rte_dma_*mb as rte_*mb so this should not result in any functional changes. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Steve Capper <steve.capper@arm.com> --- lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ 5 files changed, 55 insertions(+) diff --git a/lib/librte_eal/arm/include/rte_atomic_32.h b/lib/librte_eal/arm/include/rte_atomic_32.h index 7dc0d06d1..80208467e 100644 --- a/lib/librte_eal/arm/include/rte_atomic_32.h +++ b/lib/librte_eal/arm/include/rte_atomic_32.h @@ -33,6 +33,12 @@ extern "C" { #define rte_io_rmb() rte_rmb() +#define rte_dma_mb() rte_mb() + +#define rte_dma_wmb() rte_wmb() + +#define rte_dma_rmb() rte_rmb() + #define rte_cio_wmb() rte_wmb() #define rte_cio_rmb() rte_rmb() diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h index 7b7099cdc..608726c29 100644 --- a/lib/librte_eal/arm/include/rte_atomic_64.h +++ b/lib/librte_eal/arm/include/rte_atomic_64.h @@ -37,6 +37,12 @@ extern "C" { #define rte_io_rmb() rte_rmb() +#define rte_dma_mb() asm volatile("dmb osh" : : : "memory") + +#define rte_dma_wmb() asm volatile("dmb oshst" : : : "memory") + +#define rte_dma_rmb() asm volatile("dmb oshld" : : : "memory") + #define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") #define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") diff --git a/lib/librte_eal/include/generic/rte_atomic.h b/lib/librte_eal/include/generic/rte_atomic.h index e6ab15a97..042264c7e 100644 --- a/lib/librte_eal/include/generic/rte_atomic.h +++ b/lib/librte_eal/include/generic/rte_atomic.h @@ -107,6 +107,37 @@ static inline void rte_io_wmb(void); static inline void rte_io_rmb(void); ///@} +/** @name DMA Memory Barrier + */ +///@{ +/** + * memory barrier for DMA use cases + * + * Guarantees that the LOAD and STORE operations that precede the rte_dma_mb() + * call are visible to CPU and I/O device that is shared between all CPUs + * before the LOAD and STORE operations that follow it. + */ +static inline void rte_dma_mb(void); + +/** + * Write memory barrier for DMA use cases + * + * Guarantees that the STORE operations that precede the rte_dma_wmb() call are + * visible to CPU and I/O device that is shared between all CPUs before the + * STORE operations that follow it. + */ +static inline void rte_dma_wmb(void); + +/** + * Read memory barrier for DMA use cases + * + * Guarantees that the LOAD operations that precede the rte_dma_rmb() call are + * visible to CPU and IO device that is shared between all CPUs before the LOAD + * operations that follow it. + */ +static inline void rte_dma_rmb(void); +///@} + /** @name Coherent I/O Memory Barrier * * Coherent I/O memory barrier is a lightweight version of I/O memory diff --git a/lib/librte_eal/ppc/include/rte_atomic.h b/lib/librte_eal/ppc/include/rte_atomic.h index 7e3e13118..faa36bb76 100644 --- a/lib/librte_eal/ppc/include/rte_atomic.h +++ b/lib/librte_eal/ppc/include/rte_atomic.h @@ -36,6 +36,12 @@ extern "C" { #define rte_io_rmb() rte_rmb() +#define rte_dma_mb() rte_mb() + +#define rte_dma_wmb() rte_wmb() + +#define rte_dma_rmb() rte_rmb() + #define rte_cio_wmb() rte_wmb() #define rte_cio_rmb() rte_rmb() diff --git a/lib/librte_eal/x86/include/rte_atomic.h b/lib/librte_eal/x86/include/rte_atomic.h index 148398f50..0b1d452f3 100644 --- a/lib/librte_eal/x86/include/rte_atomic.h +++ b/lib/librte_eal/x86/include/rte_atomic.h @@ -79,6 +79,12 @@ rte_smp_mb(void) #define rte_io_rmb() rte_compiler_barrier() +#define rte_dma_mb() rte_mb() + +#define rte_dma_wmb() rte_wmb() + +#define rte_dma_rmb() rte_rmb() + #define rte_cio_wmb() rte_compiler_barrier() #define rte_cio_rmb() rte_compiler_barrier() -- 2.17.1
A 'DMB' is enough to evict the merge buffer on aarch64,when the doorbell register is mapped as 'Normal-NC', the counterpart of WC on x86. Otherwise, it is mapped as Device memory, no barriers required at all. Signed-off-by: Gavin Hu <gavin.hu@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 939778aa5..e509f3b88 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -661,7 +661,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe, rte_wmb(); mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock); if (cond) - rte_wmb(); + rte_dma_wmb(); } /** -- 2.17.1
To order the writes to host memory and the MMIO device memory, 'DMB' is sufficient on aarch64, as a 'other-multi-copy' architecture. 'DSB' is over-killing, especially in the fast path. Using the rte_dma_wmb can take the advantage on aarch64 while no impacting x86 and ppc. Fixes: 6bf10ab69be0 ("net/mlx5: support 32-bit systems") Cc: stable@dpdk.org Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index e509f3b88..da5d81350 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -546,7 +546,7 @@ __mlx5_uar_write64_relaxed(uint64_t val, void *addr, static __rte_always_inline void __mlx5_uar_write64(uint64_t val, void *addr, rte_spinlock_t *lock) { - rte_io_wmb(); + rte_dma_wmb(); __mlx5_uar_write64_relaxed(val, addr, lock); } -- 2.17.1
To ensure the WQE and doorbell record, which reside in the host memory, are visible to HW before the blue frame, an ordered mlx5_uar_write call is sufficient, a rte_wmb is overkill for aarch64. Fixes: 6cb559d67b83 ("net/mlx5: add vectorized Rx/Tx burst for x86") Cc: stable@dpdk.org Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index da5d81350..228e37de5 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -658,8 +658,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe, rte_cio_wmb(); *txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci); /* Ensure ordering between DB record and BF copy. */ - rte_wmb(); - mlx5_uar_write64_relaxed(*src, dst, txq->uar_lock); + mlx5_uar_write64(*src, dst, txq->uar_lock); if (cond) rte_dma_wmb(); } -- 2.17.1
The barrier is not required or can be moved down if HW waits for the doorbell ring to execute the WQE. This is not the case as HW can start executing the WQE until it gets the ownership(passed by SW writing the doorbell record). Add a decriptive comment for this HW specific behavior. Signed-off-by: Gavin Hu <gavin.hu@arm.com> Reviewed-by: Phil Yang <phil.yang@arm.com> --- drivers/net/mlx5/mlx5_rxtx.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 228e37de5..737d5716d 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -655,6 +655,11 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe, uint64_t *dst = MLX5_TX_BFREG(txq); volatile uint64_t *src = ((volatile uint64_t *)wqe); + /* The ownership of WQE is passed to HW by updating the doorbell + * record. Once WQE ownership has been passed to the HCA, HW can + * execute the WQE. The barrier is to ensure the WQE are visible + * to HW before HW execute it. + */ rte_cio_wmb(); *txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci); /* Ensure ordering between DB record and BF copy. */ -- 2.17.1
From: Phil Yang <phil.yang@arm.com> PMD Rx queue descriptor contains two mlx5_mprq_buf fields, which are the multi-packet RQ buffer header pointers. It uses the common rte_atomic_XXX functions to make sure the refcnt access is atomic. The common rte_atomic_XXX functions are full barriers on aarch64. Optimized it with one-way barrier to improve performance. Fixes: 7d6bf6b866b8 ("net/mlx5: add Multi-Packet Rx support") Cc: stable@dpdk.org Suggested-by: Gavin Hu <gavin.hu@arm.com> Signed-off-by: Phil Yang <phil.yang@arm.com> Reviewed-by: Gavin Hu <gavin.hu@arm.com> --- drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index 8a6b410ef..834057c3b 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -1539,7 +1539,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg, memset(_m, 0, sizeof(*buf)); buf->mp = mp; - rte_atomic16_set(&buf->refcnt, 1); + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); for (j = 0; j != strd_n; ++j) { shinfo = &buf->shinfos[j]; shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index f3bf76376..039dd0a05 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1592,10 +1592,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque) { struct mlx5_mprq_buf *buf = opaque; - if (rte_atomic16_read(&buf->refcnt) == 1) { + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { rte_mempool_put(buf->mp, buf); - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { - rte_atomic16_set(&buf->refcnt, 1); + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, + __ATOMIC_RELAXED) == 0)) { + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); rte_mempool_put(buf->mp, buf); } } @@ -1676,7 +1677,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) if (consumed_strd == strd_n) { /* Replace WQE only if the buffer is still in use. */ - if (rte_atomic16_read(&buf->refcnt) > 1) { + if (__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) > 1) { mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n); /* Release the old buffer. */ mlx5_mprq_buf_free(buf); @@ -1766,9 +1768,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; /* Increment the refcnt of the whole chunk. */ - rte_atomic16_add_return(&buf->refcnt, 1); - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <= - strd_n + 1); + __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE); + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) <= strd_n + 1); buf_addr = RTE_PTR_SUB(addr, headroom_sz); /* * MLX5 device doesn't use iova but it is necessary in a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 737d5716d..d0a1bffa5 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -78,7 +78,7 @@ struct rxq_zip { /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { struct rte_mempool *mp; - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ + uint16_t refcnt; /* Atomically accessed refcnt. */ uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */ struct rte_mbuf_ext_shared_info shinfos[]; /* -- 2.17.1
The 'tx_db_nc' is used to differntiate two mapping types, WC and non-WC, both are actually non-cacheable. The Write-Combining on x86, is not-cacheablei. The Normal-NC, the counterpart on aarch64, is non-cacheable too, as its name suggests, the cache hierarchy was bypassed for accesses to these two memory regions. Interpreting it to 'non-cacheable' makes no distinction and is confusing. re-interprete it to 'non-combining' can make the distinction. Also explains why aarch64 default setting for this is different. Signed-off-by: Gavin Hu <gavin.hu@arm.com> --- doc/guides/nics/mlx5.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst index afd11cd83..addec9f12 100644 --- a/doc/guides/nics/mlx5.rst +++ b/doc/guides/nics/mlx5.rst @@ -610,9 +610,9 @@ Run-time configuration The rdma core library can map doorbell register in two ways, depending on the environment variable "MLX5_SHUT_UP_BF": - - As regular cached memory (usually with write combining attribute), if the + - As regular memory (usually with write combining attribute), if the variable is either missing or set to zero. - - As non-cached memory, if the variable is present and set to not "0" value. + - As non-combining memory, if the variable is present and set to not "0" value. The type of mapping may slightly affect the Tx performance, the optimal choice is strongly relied on the host architecture and should be deduced practically. @@ -638,6 +638,8 @@ Run-time configuration If ``tx_db_nc`` is omitted or set to zero, the preset (if any) environment variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF", the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others. + ARM64 is different because it has a gathering buffer, the enforced barrier + can evict the doorbell ring immediately. - ``tx_vec_en`` parameter [int] -- 2.17.1
On 4/10/20 7:41 PM, Gavin Hu wrote: > To order writes to various memory types, 'sfence' is required for x86, > and 'dmb oshst' is required for aarch64. > > But within DPDK, there is no abstracted barriers covers this > combination: sfence(x86)/dmb(aarch64). > > So introduce a new barrier class - rte_dma_*mb for this combination, > > Doorbell rings are typical use cases of this new barrier class, which > requires something ready in the memory before letting HW aware. > > As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while > rte_wmb is 'dsb' for aarch64. As far as I can see rte_cio_wmb() is exactly definition of the barrier to be used for doorbells. Am I missing something? May be it is just a bug in rte_cio_wmb() on x86? > In the joint preliminary testing between Arm and Ampere, 8%~13% > performance boost was measured. > > As there is no functionality changes, it will not impact x86. > > Gavin Hu (6): > eal: introduce new class of barriers for DMA use cases > net/mlx5: dmb for immediate doorbell ring on aarch64 > net/mlx5: relax barrier to order UAR writes on aarch64 > net/mlx5: relax barrier for aarch64 > net/mlx5: add descriptive comment for a barrier > doc: clarify one configuration in mlx5 guide > > Phil Yang (1): > net/mlx5: relax ordering for multi-packet RQ buffer refcnt > > doc/guides/nics/mlx5.rst | 6 ++-- > drivers/net/mlx5/mlx5_rxq.c | 2 +- > drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- > drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- > lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ > lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ > lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ > lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ > lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ > 9 files changed, 78 insertions(+), 15 deletions(-) >
Hi Andrew, > -----Original Message----- > From: Andrew Rybchenko <arybchenko@solarflare.com> > Sent: Saturday, April 11, 2020 1:21 AM > To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org > Cc: nd <nd@arm.com>; david.marchand@redhat.com; > thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com; > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com; > jerinj@marvell.com; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com> > Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and > use it for mlx5 PMD > > On 4/10/20 7:41 PM, Gavin Hu wrote: > > To order writes to various memory types, 'sfence' is required for x86, > > and 'dmb oshst' is required for aarch64. > > > > But within DPDK, there is no abstracted barriers covers this > > combination: sfence(x86)/dmb(aarch64). > > > > So introduce a new barrier class - rte_dma_*mb for this combination, > > > > Doorbell rings are typical use cases of this new barrier class, which > > requires something ready in the memory before letting HW aware. > > > > As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, > while > > rte_wmb is 'dsb' for aarch64. > > As far as I can see rte_cio_wmb() is exactly definition of the barrier > to be used for doorbells. Am I missing something? I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device. rte_io_wmb is for MMIO regions. We are missing the barriers for various memory types, eg. Doorbell cases. There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?) http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124 > May be it is just a bug in rte_cio_wmb() on x86? rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types. DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device. Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86. http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087 /Gavin > > > In the joint preliminary testing between Arm and Ampere, 8%~13% > > performance boost was measured. > > > > As there is no functionality changes, it will not impact x86. > > > > Gavin Hu (6): > > eal: introduce new class of barriers for DMA use cases > > net/mlx5: dmb for immediate doorbell ring on aarch64 > > net/mlx5: relax barrier to order UAR writes on aarch64 > > net/mlx5: relax barrier for aarch64 > > net/mlx5: add descriptive comment for a barrier > > doc: clarify one configuration in mlx5 guide > > > > Phil Yang (1): > > net/mlx5: relax ordering for multi-packet RQ buffer refcnt > > > > doc/guides/nics/mlx5.rst | 6 ++-- > > drivers/net/mlx5/mlx5_rxq.c | 2 +- > > drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- > > drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- > > lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ > > lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ > > lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ > > lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ > > lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ > > 9 files changed, 78 insertions(+), 15 deletions(-) > >
On 4/11/20 6:46 AM, Gavin Hu wrote: > Hi Andrew, > >> -----Original Message----- >> From: Andrew Rybchenko <arybchenko@solarflare.com> >> Sent: Saturday, April 11, 2020 1:21 AM >> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org >> Cc: nd <nd@arm.com>; david.marchand@redhat.com; >> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com; >> bruce.richardson@intel.com; konstantin.ananyev@intel.com; >> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com; >> jerinj@marvell.com; Honnappa Nagarahalli >> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang >> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong >> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com> >> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and >> use it for mlx5 PMD >> >> On 4/10/20 7:41 PM, Gavin Hu wrote: >>> To order writes to various memory types, 'sfence' is required for x86, >>> and 'dmb oshst' is required for aarch64. >>> >>> But within DPDK, there is no abstracted barriers covers this >>> combination: sfence(x86)/dmb(aarch64). >>> >>> So introduce a new barrier class - rte_dma_*mb for this combination, >>> >>> Doorbell rings are typical use cases of this new barrier class, which >>> requires something ready in the memory before letting HW aware. >>> >>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, >> while >>> rte_wmb is 'dsb' for aarch64. >> >> As far as I can see rte_cio_wmb() is exactly definition of the barrier >> to be used for doorbells. Am I missing something? > > I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device. > rte_io_wmb is for MMIO regions. > We are missing the barriers for various memory types, eg. Doorbell cases. When the patch series is applied, we'll have 5 types of memory barriers: regular, smp, cio, io, dma. Do we really need so many? May be we need a table in description which could help to make the right choice. I.e. type of access on both axis and type of barrier to use on intersection. > There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?) > http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124 >> May be it is just a bug in rte_cio_wmb() on x86? > rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types. > DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device. > Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86. > http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087 May be my problem that I don't fully understand real-life usecases when cio should be used in accordance with its current definition. Does it make sense without doorbell? Does HW polling via DMA? Thanks for explanations, Andrew. >> >>> In the joint preliminary testing between Arm and Ampere, 8%~13% >>> performance boost was measured. >>> >>> As there is no functionality changes, it will not impact x86. >>> >>> Gavin Hu (6): >>> eal: introduce new class of barriers for DMA use cases >>> net/mlx5: dmb for immediate doorbell ring on aarch64 >>> net/mlx5: relax barrier to order UAR writes on aarch64 >>> net/mlx5: relax barrier for aarch64 >>> net/mlx5: add descriptive comment for a barrier >>> doc: clarify one configuration in mlx5 guide >>> >>> Phil Yang (1): >>> net/mlx5: relax ordering for multi-packet RQ buffer refcnt >>> >>> doc/guides/nics/mlx5.rst | 6 ++-- >>> drivers/net/mlx5/mlx5_rxq.c | 2 +- >>> drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- >>> drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- >>> lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ >>> lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ >>> lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++ >>> lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ >>> lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ >>> 9 files changed, 78 insertions(+), 15 deletions(-) >>> >
Hi Andrew, > -----Original Message----- > From: Andrew Rybchenko <arybchenko@solarflare.com> > Sent: Monday, April 13, 2020 5:52 PM > To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org > Cc: nd <nd@arm.com>; david.marchand@redhat.com; > thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com; > bruce.richardson@intel.com; konstantin.ananyev@intel.com; > matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com; > jerinj@marvell.com; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com> > Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and > use it for mlx5 PMD > > On 4/11/20 6:46 AM, Gavin Hu wrote: > > Hi Andrew, > > > >> -----Original Message----- > >> From: Andrew Rybchenko <arybchenko@solarflare.com> > >> Sent: Saturday, April 11, 2020 1:21 AM > >> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org > >> Cc: nd <nd@arm.com>; david.marchand@redhat.com; > >> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com; > >> bruce.richardson@intel.com; konstantin.ananyev@intel.com; > >> matan@mellanox.com; shahafs@mellanox.com; > viacheslavo@mellanox.com; > >> jerinj@marvell.com; Honnappa Nagarahalli > >> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang > >> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong > >> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com> > >> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class > and > >> use it for mlx5 PMD > >> > >> On 4/10/20 7:41 PM, Gavin Hu wrote: > >>> To order writes to various memory types, 'sfence' is required for x86, > >>> and 'dmb oshst' is required for aarch64. > >>> > >>> But within DPDK, there is no abstracted barriers covers this > >>> combination: sfence(x86)/dmb(aarch64). > >>> > >>> So introduce a new barrier class - rte_dma_*mb for this combination, > >>> > >>> Doorbell rings are typical use cases of this new barrier class, which > >>> requires something ready in the memory before letting HW aware. > >>> > >>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, > >> while > >>> rte_wmb is 'dsb' for aarch64. > >> > >> As far as I can see rte_cio_wmb() is exactly definition of the barrier > >> to be used for doorbells. Am I missing something? > > > > I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, > work queues, located in the host memory, but shared between CPU and IO > device. > > rte_io_wmb is for MMIO regions. > > We are missing the barriers for various memory types, eg. Doorbell cases. > > When the patch series is applied, we'll have 5 types of memory > barriers: regular, smp, cio, io, dma. Do we really need so > many? May be we need a table in description which could > help to make the right choice. I.e. type of access on both > axis and type of barrier to use on intersection. Yes, good suggestion! Actually Honnappa and I already made a table sheet for this. Will provide it in next release! Thanks for your opinions! > > > There is an implication in the definition of rte_cio_wmb, it can not be used > for non-coherent MMIO region(WC?) > > > http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/g > eneric/rte_atomic.h#L124 > >> May be it is just a bug in rte_cio_wmb() on x86? > > rte_cio_wmb is ok for doorbells on aarch64, but looking through the > kernel code, 'sfence' is required for various/mixed memory types. > > DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences > of writes to host memory that shared by IO device. > > Strengthening rte_cio_wmb may hurt performance, so a new barrier class > is introduced to optimize for aarch64, in the fast path only, while not > impacting x86. > > > http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1 > 087 > > May be my problem that I don't fully understand real-life > usecases when cio should be used in accordance with its > current definition. Does it make sense without doorbell? > Does HW polling via DMA? > > Thanks for explanations, > Andrew. > > >> > >>> In the joint preliminary testing between Arm and Ampere, 8%~13% > >>> performance boost was measured. > >>> > >>> As there is no functionality changes, it will not impact x86. > >>> > >>> Gavin Hu (6): > >>> eal: introduce new class of barriers for DMA use cases > >>> net/mlx5: dmb for immediate doorbell ring on aarch64 > >>> net/mlx5: relax barrier to order UAR writes on aarch64 > >>> net/mlx5: relax barrier for aarch64 > >>> net/mlx5: add descriptive comment for a barrier > >>> doc: clarify one configuration in mlx5 guide > >>> > >>> Phil Yang (1): > >>> net/mlx5: relax ordering for multi-packet RQ buffer refcnt > >>> > >>> doc/guides/nics/mlx5.rst | 6 ++-- > >>> drivers/net/mlx5/mlx5_rxq.c | 2 +- > >>> drivers/net/mlx5/mlx5_rxtx.c | 16 ++++++----- > >>> drivers/net/mlx5/mlx5_rxtx.h | 14 ++++++---- > >>> lib/librte_eal/arm/include/rte_atomic_32.h | 6 ++++ > >>> lib/librte_eal/arm/include/rte_atomic_64.h | 6 ++++ > >>> lib/librte_eal/include/generic/rte_atomic.h | 31 > +++++++++++++++++++++ > >>> lib/librte_eal/ppc/include/rte_atomic.h | 6 ++++ > >>> lib/librte_eal/x86/include/rte_atomic.h | 6 ++++ > >>> 9 files changed, 78 insertions(+), 15 deletions(-) > >>> > >
Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy atomicity memory model. Armv8-a memory model has been strengthened to require other-multi-copy atomicity. This property requires memory accesses from an observer to become visible to all other observers simultaneously [3]. This means a) A write arriving at an endpoint shared between multiple CPUs is visible to all CPUs b) A write that is visible to all CPUs is also visible to all other observers in the shareability domain This allows for using cheaper DMB instructions in the place of DSB for devices that are visible to all CPUs (i.e. devices that DPDK caters to). Please refer to [1], [2] and [3] for more information. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f [2] https://www.youtube.com/watch?v=i6DayghhA8Q [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> --- lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h index 7b7099cdc..e406411bb 100644 --- a/lib/librte_eal/arm/include/rte_atomic_64.h +++ b/lib/librte_eal/arm/include/rte_atomic_64.h @@ -19,11 +19,11 @@ extern "C" { #include <rte_compat.h> #include <rte_debug.h> -#define rte_mb() asm volatile("dsb sy" : : : "memory") +#define rte_mb() asm volatile("dmb osh" : : : "memory") -#define rte_wmb() asm volatile("dsb st" : : : "memory") +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") -#define rte_rmb() asm volatile("dsb ld" : : : "memory") +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") @@ -37,9 +37,9 @@ extern "C" { #define rte_io_rmb() rte_rmb() -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") +#define rte_cio_wmb() rte_wmb() -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") +#define rte_cio_rmb() rte_rmb() /*------------------------ 128 bit atomic operations -------------------------*/ -- 2.17.1
> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Tuesday, May 12, 2020 2:07 AM
> To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit
> Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>;
> igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: [RFC] eal: adjust barriers for IO on Armv8-a
>
> Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> atomicity memory model.
>
> Armv8-a memory model has been strengthened to require other-multi-copy
> atomicity. This property requires memory accesses from an observer to
> become visible to all other observers simultaneously [3]. This means
>
> a) A write arriving at an endpoint shared between multiple CPUs is
> visible to all CPUs
> b) A write that is visible to all CPUs is also visible to all other
> observers in the shareability domain
>
> This allows for using cheaper DMB instructions in the place of DSB for devices
> that are visible to all CPUs (i.e. devices that DPDK caters to).
>
> Please refer to [1], [2] and [3] for more information.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i
> d=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
> lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> b/lib/librte_eal/arm/include/rte_atomic_64.h
> index 7b7099cdc..e406411bb 100644
> --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> @@ -19,11 +19,11 @@ extern "C" {
> #include <rte_compat.h>
> #include <rte_debug.h>
>
> -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> +#define rte_mb() asm volatile("dmb osh" : : : "memory")
>
> -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
>
> -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
>
> #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
>
> @@ -37,9 +37,9 @@ extern "C" {
>
> #define rte_io_rmb() rte_rmb()
>
> -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> +#define rte_cio_wmb() rte_wmb()
>
> -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> +#define rte_cio_rmb() rte_rmb()
>
> /*------------------------ 128 bit atomic operations -------------------------*/
>
> --
> 2.17.1
This change showed about 7% performance gain in testpmd single core NDR test.
Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote: > > > > -----Original Message----- > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Sent: Tuesday, May 12, 2020 2:07 AM > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit > > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>; > > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com; > > arybchenko@solarflare.com; Honnappa Nagarahalli > > <Honnappa.Nagarahalli@arm.com> > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com> > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy > > atomicity memory model. > > > > Armv8-a memory model has been strengthened to require other-multi-copy > > atomicity. This property requires memory accesses from an observer to > > become visible to all other observers simultaneously [3]. This means > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > visible to all CPUs > > b) A write that is visible to all CPUs is also visible to all other > > observers in the shareability domain > > > > This allows for using cheaper DMB instructions in the place of DSB for devices > > that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > Please refer to [1], [2] and [3] for more information. > > > > [1] > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i > > d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > --- > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > index 7b7099cdc..e406411bb 100644 > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > @@ -19,11 +19,11 @@ extern "C" { > > #include <rte_compat.h> > > #include <rte_debug.h> > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > #define rte_io_rmb() rte_rmb() > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > +#define rte_cio_wmb() rte_wmb() > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > +#define rte_cio_rmb() rte_rmb() > > > > /*------------------------ 128 bit atomic operations -------------------------*/ > > > > -- > > 2.17.1 > > This change showed about 7% performance gain in testpmd single core NDR test. I am trying to understand this patch wrt DPDK current usage model? 1) Is performance improvement due to the fact that the PMD that you are using it for testing suppose to use existing rte_cio_* but it was using rte_[rw]mb? 2) In my understanding : a) CPU to CPU barrier requirements are addressed by rte_smp_* b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb If (c) is true then we are violating the DPDK spec with change. Right? This change will not be required if fastpath (CPU to Device) is using rte_cio_*. Right? > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> >
> -----Original Message----- > From: Jerin Jacob <jerinjacobk@gmail.com> > Sent: Tuesday, May 12, 2020 2:42 PM > To: Ruifeng Wang <Ruifeng.Wang@arm.com> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>; > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com; > arybchenko@solarflare.com; nd <nd@arm.com> > Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a > > On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com> > wrote: > > > > > > > -----Original Message----- > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > Sent: Tuesday, May 12, 2020 2:07 AM > > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit > > > Khaparde (ajit.khaparde@broadcom.com) > <ajit.khaparde@broadcom.com>; > > > igorch@amazon.com; thomas@monjalon.net; > viacheslavo@mellanox.com; > > > arybchenko@solarflare.com; Honnappa Nagarahalli > > > <Honnappa.Nagarahalli@arm.com> > > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com> > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > other-multi-copy atomicity memory model. > > > > > > Armv8-a memory model has been strengthened to require > > > other-multi-copy atomicity. This property requires memory accesses > > > from an observer to become visible to all other observers > > > simultaneously [3]. This means > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > visible to all CPUs > > > b) A write that is visible to all CPUs is also visible to all other > > > observers in the shareability domain > > > > > > This allows for using cheaper DMB instructions in the place of DSB > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > [1] > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c > > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > --- > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > index 7b7099cdc..e406411bb 100644 > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > @@ -19,11 +19,11 @@ extern "C" { > > > #include <rte_compat.h> > > > #include <rte_debug.h> > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > > +#define rte_cio_wmb() rte_wmb() > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > > +#define rte_cio_rmb() rte_rmb() > > > > > > /*------------------------ 128 bit atomic operations > > > -------------------------*/ > > > > > > -- > > > 2.17.1 > > > > This change showed about 7% performance gain in testpmd single core > NDR test. > > I am trying to understand this patch wrt DPDK current usage model? > > 1) Is performance improvement due to the fact that the PMD that you are > using it for testing suppose to use existing rte_cio_* but it was using > rte_[rw]mb? This is part of the reason. There are also cases where rte_io_* was used and can be relaxed. Such as: http://patches.dpdk.org/patch/68162/ > 2) In my understanding : > a) CPU to CPU barrier requirements are addressed by rte_smp_* > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > If (c) is true then we are violating the DPDK spec with change. Right? Developers are still required to use correct barrier APIs for different use cases. I think this change mitigates performance penalty when non optimal barrier is used. > This change will not be required if fastpath (CPU to Device) is using rte_cio_*. > Right? See 1). Correct usage of rte_cio_* is not the whole. For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'. > > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > >
On Tue, May 12, 2020 at 1:32 PM Ruifeng Wang <Ruifeng.Wang@arm.com> wrote: > > > > -----Original Message----- > > From: Jerin Jacob <jerinjacobk@gmail.com> > > Sent: Tuesday, May 12, 2020 2:42 PM > > To: Ruifeng Wang <Ruifeng.Wang@arm.com> > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > > dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit > > Khaparde (ajit.khaparde@broadcom.com) <ajit.khaparde@broadcom.com>; > > igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com; > > arybchenko@solarflare.com; nd <nd@arm.com> > > Subject: Re: [dpdk-dev] [RFC] eal: adjust barriers for IO on Armv8-a > > > > On Tue, May 12, 2020 at 11:48 AM Ruifeng Wang <Ruifeng.Wang@arm.com> > > wrote: > > > > > > > > > > -----Original Message----- > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > Sent: Tuesday, May 12, 2020 2:07 AM > > > > To: dev@dpdk.org; jerinj@marvell.com; hemant.agrawal@nxp.com; Ajit > > > > Khaparde (ajit.khaparde@broadcom.com) > > <ajit.khaparde@broadcom.com>; > > > > igorch@amazon.com; thomas@monjalon.net; > > viacheslavo@mellanox.com; > > > > arybchenko@solarflare.com; Honnappa Nagarahalli > > > > <Honnappa.Nagarahalli@arm.com> > > > > Cc: Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com> > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > > other-multi-copy atomicity memory model. > > > > > > > > Armv8-a memory model has been strengthened to require > > > > other-multi-copy atomicity. This property requires memory accesses > > > > from an observer to become visible to all other observers > > > > simultaneously [3]. This means > > > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > > visible to all CPUs > > > > b) A write that is visible to all CPUs is also visible to all other > > > > observers in the shareability domain > > > > > > > > This allows for using cheaper DMB instructions in the place of DSB > > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > > > [1] > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c > > > > ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > --- > > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > index 7b7099cdc..e406411bb 100644 > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > @@ -19,11 +19,11 @@ extern "C" { > > > > #include <rte_compat.h> > > > > #include <rte_debug.h> > > > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > > > +#define rte_cio_wmb() rte_wmb() > > > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > > > +#define rte_cio_rmb() rte_rmb() > > > > > > > > /*------------------------ 128 bit atomic operations > > > > -------------------------*/ > > > > > > > > -- > > > > 2.17.1 > > > > > > This change showed about 7% performance gain in testpmd single core > > NDR test. > > > > I am trying to understand this patch wrt DPDK current usage model? > > > > 1) Is performance improvement due to the fact that the PMD that you are > > using it for testing suppose to use existing rte_cio_* but it was using > > rte_[rw]mb? > > This is part of the reason. There are also cases where rte_io_* was used and can be relaxed. > Such as: http://patches.dpdk.org/patch/68162/ > > > 2) In my understanding : > > a) CPU to CPU barrier requirements are addressed by rte_smp_* > > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* > > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > > > If (c) is true then we are violating the DPDK spec with change. Right? > > Developers are still required to use correct barrier APIs for different use cases. > I think this change mitigates performance penalty when non optimal barrier is used. But does it violate the contract? We are using rte_[rw]mb as a low performance/heavyweight for all the cases. I think that is the contract to DPDK consumers. For different requirment, We have a specific API. IMO, It makes sense to change the fastpath code for more fine granted barriers based on the need rather than changing the generic one to lightweight. i.e rte_[rw]wb is the superset that works on all cases and use customized one for the specific use case. > > > This change will not be required if fastpath (CPU to Device) is using rte_cio_*. > > Right? > > See 1). Correct usage of rte_cio_* is not the whole. > For some other use cases, such as barrier between accesses of different memory types, we can also use lighter barrier 'dmb'. > > > > > > > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > > >
<snip> > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > > other-multi-copy atomicity memory model. > > > > > > > > Armv8-a memory model has been strengthened to require > > > > other-multi-copy atomicity. This property requires memory accesses > > > > from an observer to become visible to all other observers > > > > simultaneously [3]. This means > > > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > > visible to all CPUs > > > > b) A write that is visible to all CPUs is also visible to all other > > > > observers in the shareability domain > > > > > > > > This allows for using cheaper DMB instructions in the place of DSB > > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > > > [1] > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > --- > > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > index 7b7099cdc..e406411bb 100644 > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > @@ -19,11 +19,11 @@ extern "C" { > > > > #include <rte_compat.h> > > > > #include <rte_debug.h> > > > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > > > +#define rte_cio_wmb() rte_wmb() > > > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > > > +#define rte_cio_rmb() rte_rmb() > > > > > > > > /*------------------------ 128 bit atomic operations > > > > -------------------------*/ > > > > > > > > -- > > > > 2.17.1 > > > > > > This change showed about 7% performance gain in testpmd single core > > NDR test. > > > > I am trying to understand this patch wrt DPDK current usage model? > > > > 1) Is performance improvement due to the fact that the PMD that you > > are using it for testing suppose to use existing rte_cio_* but it was > > using rte_[rw]mb? No, it is supposed to use rte_[rw]mb for x86. > > This is part of the reason. There are also cases where rte_io_* was used and > can be relaxed. > Such as: http://patches.dpdk.org/patch/68162/ > > > 2) In my understanding : > > a) CPU to CPU barrier requirements are addressed by rte_smp_* > > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* > > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > > > If (c) is true then we are violating the DPDK spec with change. Right? No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c). > > Developers are still required to use correct barrier APIs for different use cases. > I think this change mitigates performance penalty when non optimal barrier is > used. > > > This change will not be required if fastpath (CPU to Device) is using > rte_cio_*. > > Right? Yes. It is required when the fastpath uses rte_[rw]mb. > > See 1). Correct usage of rte_cio_* is not the whole. > For some other use cases, such as barrier between accesses of different > memory types, we can also use lighter barrier 'dmb'. > > > > > > > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > > >
On Wed, May 13, 2020 at 3:14 AM Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote: > > <snip> > > > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > > > other-multi-copy atomicity memory model. > > > > > > > > > > Armv8-a memory model has been strengthened to require > > > > > other-multi-copy atomicity. This property requires memory accesses > > > > > from an observer to become visible to all other observers > > > > > simultaneously [3]. This means > > > > > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > > > visible to all CPUs > > > > > b) A write that is visible to all CPUs is also visible to all other > > > > > observers in the shareability domain > > > > > > > > > > This allows for using cheaper DMB instructions in the place of DSB > > > > > for devices that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > > > > > [1] > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > > > > /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > > > > --- > > > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > index 7b7099cdc..e406411bb 100644 > > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > @@ -19,11 +19,11 @@ extern "C" { > > > > > #include <rte_compat.h> > > > > > #include <rte_debug.h> > > > > > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > +#define rte_cio_wmb() rte_wmb() > > > > > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > +#define rte_cio_rmb() rte_rmb() > > > > > > > > > > /*------------------------ 128 bit atomic operations > > > > > -------------------------*/ > > > > > > > > > > -- > > > > > 2.17.1 > > > > > > > > This change showed about 7% performance gain in testpmd single core > > > NDR test. > > > > > > I am trying to understand this patch wrt DPDK current usage model? > > > > > > 1) Is performance improvement due to the fact that the PMD that you > > > are using it for testing suppose to use existing rte_cio_* but it was > > > using rte_[rw]mb? > No, it is supposed to use rte_[rw]mb for x86. Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b created for this pupose. But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it correct from x86 PoV? @Ananyev, Konstantin @Richardson, Bruce ? For x86: #define rte_io_wmb() rte_compiler_barrier() #define rte_io_rmb() rte_compiler_barrier() #define rte_cio_wmb() rte_compiler_barrier() #define rte_cio_rmb() rte_compiler_barrier() > > > > > This is part of the reason. There are also cases where rte_io_* was used and > > can be relaxed. > > Such as: http://patches.dpdk.org/patch/68162/ > > > > > 2) In my understanding : > > > a) CPU to CPU barrier requirements are addressed by rte_smp_* > > > b) CPU to DMA/Device barrier requirements are addressed by rte_cio_* > > > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > > > > > If (c) is true then we are violating the DPDK spec with change. Right? > No, we are not. Essentially, due to the other-multi-copy atomicity behavior of the architecture, we are saying 'DMB OSH*' is enough to achieve (c). Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to have the barrier between 4 of them? 1) Global memory (BSS and Data sections), Not mapped as a hugepage. 2) Hugepage memory 3) IOVA memory 4) PCI register read/write Do we need to worry about anything else which is specific to DSB? example, TLB related flush etc. If we are talking this path then rte_cio_[rw]mb() has no meaning in DPDK as an abstraction as it was created for arm64 for this specific purpose. If we can meet all DPDK usecse with DMB OSH then probably we can deprecate rte_cio_wmb to avoid confusion. > > > > > Developers are still required to use correct barrier APIs for different use cases. > > I think this change mitigates performance penalty when non optimal barrier is > > used. > > > > > This change will not be required if fastpath (CPU to Device) is using > > rte_cio_*. > > > Right? > Yes. It is required when the fastpath uses rte_[rw]mb. > > > > > See 1). Correct usage of rte_cio_* is not the whole. > > For some other use cases, such as barrier between accesses of different > > memory types, we can also use lighter barrier 'dmb'. > > > > > > > > > > > > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > > > > >
<snip> > > > > > > > > Subject: [RFC] eal: adjust barriers for IO on Armv8-a > > > > > > > > > > > > Change the barrier APIs for IO to reflect that Armv8-a is > > > > > > other-multi-copy atomicity memory model. > > > > > > > > > > > > Armv8-a memory model has been strengthened to require > > > > > > other-multi-copy atomicity. This property requires memory > > > > > > accesses from an observer to become visible to all other > > > > > > observers simultaneously [3]. This means > > > > > > > > > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > > > > > visible to all CPUs > > > > > > b) A write that is visible to all CPUs is also visible to all other > > > > > > observers in the shareability domain > > > > > > > > > > > > This allows for using cheaper DMB instructions in the place of > > > > > > DSB for devices that are visible to all CPUs (i.e. devices that DPDK > caters to). > > > > > > > > > > > > Please refer to [1], [2] and [3] for more information. > > > > > > > > > > > > [1] > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux > > > > > > .git /c ommit/?i d=22ec71615d824f4f11d38d0e55a88d8956b7e45f > > > > > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > > > > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > > > > > > > > > Signed-off-by: Honnappa Nagarahalli > > > > > > <honnappa.nagarahalli@arm.com> > > > > > > --- > > > > > > lib/librte_eal/arm/include/rte_atomic_64.h | 10 +++++----- > > > > > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > > > > > > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > > index 7b7099cdc..e406411bb 100644 > > > > > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > > > > > @@ -19,11 +19,11 @@ extern "C" { #include <rte_compat.h> > > > > > > #include <rte_debug.h> > > > > > > > > > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > > > > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > > > > > > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > > > > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > > > > > > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > > > > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > > > > > > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > > > > > > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > > > > > > > > > #define rte_io_rmb() rte_rmb() > > > > > > > > > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : > > > > > > "memory") > > > > > > +#define rte_cio_wmb() rte_wmb() > > > > > > > > > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : > > > > > > "memory") > > > > > > +#define rte_cio_rmb() rte_rmb() > > > > > > > > > > > > /*------------------------ 128 bit atomic operations > > > > > > -------------------------*/ > > > > > > > > > > > > -- > > > > > > 2.17.1 > > > > > > > > > > This change showed about 7% performance gain in testpmd single > > > > > core > > > > NDR test. > > > > > > > > I am trying to understand this patch wrt DPDK current usage model? > > > > > > > > 1) Is performance improvement due to the fact that the PMD that > > > > you are using it for testing suppose to use existing rte_cio_* but > > > > it was using rte_[rw]mb? > > No, it is supposed to use rte_[rw]mb for x86. > > Why drivers using rte_[rw]in fastpath, IMO, rte_io_[rw]b and rte_cio_[rw]b > created for this pupose. > > But I understand, in x86 it is mapped to rte_compiler_barrier(). Is it correct > from x86 PoV? > @Ananyev, Konstantin @Richardson, Bruce ? > > For x86: > #define rte_io_wmb() rte_compiler_barrier() #define rte_io_rmb() > rte_compiler_barrier() #define rte_cio_wmb() rte_compiler_barrier() #define > rte_cio_rmb() rte_compiler_barrier() We need a barrier API with 'DMB OSH*' for Arm and '*fence' for x86. My understanding is, '*fence' is required when WC memory is used in x86. Also, from Arm architecture perspective, effectively we are saying that 'DSB' is not required for portable drivers. > > > > > > > > > > This is part of the reason. There are also cases where rte_io_* was > > > used and can be relaxed. > > > Such as: http://patches.dpdk.org/patch/68162/ > > > > > > > 2) In my understanding : > > > > a) CPU to CPU barrier requirements are addressed by rte_smp_* > > > > b) CPU to DMA/Device barrier requirements are addressed by > > > > rte_cio_* > > > > c) CPU to ANY(CPU or Device) are addressed by rte_[rw]mb > > > > > > > > If (c) is true then we are violating the DPDK spec with change. Right? > > No, we are not. Essentially, due to the other-multi-copy atomicity behavior > of the architecture, we are saying 'DMB OSH*' is enough to achieve (c). > > Yeah. Probably from userspace POV it should be OK to use "DMB OSH*" to > have the barrier between 4 of them? > > 1) Global memory (BSS and Data sections), Not mapped as a hugepage. > 2) Hugepage memory > 3) IOVA memory > 4) PCI register read/write > > Do we need to worry about anything else which is specific to DSB? > example, TLB related flush etc. Yes, things like TLB flush or self modifying code still need DSB. But, my understanding is we do not have such code in DPDK and such code will be platform specific. > > If we are talking this path then rte_cio_[rw]mb() has no meaning in DPDK as > an abstraction as it was created for arm64 for this specific purpose. > If we can meet all DPDK usecse with DMB OSH then probably we can > deprecate rte_cio_wmb to avoid confusion. Agree, rte_cio_*mb is confusing to me. We could deprecate those. I see Octeon TX/TX2 drivers using rte_*mb. Do you see any issues with this change in those drivers? This is a very fundamental change, we need more feedback from others working with Arm platforms. > > > > > > > > > Developers are still required to use correct barrier APIs for different use > cases. > > > I think this change mitigates performance penalty when non optimal > > > barrier is used. > > > > > > > This change will not be required if fastpath (CPU to Device) is > > > > using > > > rte_cio_*. > > > > Right? > > Yes. It is required when the fastpath uses rte_[rw]mb. > > > > > > > > See 1). Correct usage of rte_cio_* is not the whole. > > > For some other use cases, such as barrier between accesses of > > > different memory types, we can also use lighter barrier 'dmb'. > > > > > > > > > > > > > > > > > > > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > > > > > > >
Use c11 atomics with explicit ordering instead of the rte_atomic ops which enforce unnecessary barriers on aarch64. Signed-off-by: Phil Yang <phil.yang@arm.com> --- v3: Split from the patchset: http://patchwork.dpdk.org/cover/68159/ drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg, memset(_m, 0, sizeof(*buf)); buf->mp = mp; - rte_atomic16_set(&buf->refcnt, 1); + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); for (j = 0; j != strd_n; ++j) { shinfo = &buf->shinfos[j]; shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index e4106bf..f0eda88 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque) { struct mlx5_mprq_buf *buf = opaque; - if (rte_atomic16_read(&buf->refcnt) == 1) { + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { rte_mempool_put(buf->mp, buf); - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { - rte_atomic16_set(&buf->refcnt, 1); + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, + __ATOMIC_RELAXED) == 0)) { + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); rte_mempool_put(buf->mp, buf); } } @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) if (consumed_strd == strd_n) { /* Replace WQE only if the buffer is still in use. */ - if (rte_atomic16_read(&buf->refcnt) > 1) { + if (__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) > 1) { mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n); /* Release the old buffer. */ mlx5_mprq_buf_free(buf); @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; /* Increment the refcnt of the whole chunk. */ - rte_atomic16_add_return(&buf->refcnt, 1); - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <= - strd_n + 1); + __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE); + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) <= strd_n + 1); buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM); /* * MLX5 device doesn't use iova but it is necessary in a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -78,7 +78,7 @@ struct rxq_zip { /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { struct rte_mempool *mp; - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ + uint16_t refcnt; /* Atomically accessed refcnt. */ uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */ struct rte_mbuf_ext_shared_info shinfos[]; /* -- 2.7.4
Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy atomicity memory model. Armv8-a memory model has been strengthened to require other-multi-copy atomicity. This property requires memory accesses from an observer to become visible to all other observers simultaneously [3]. This means a) A write arriving at an endpoint shared between multiple CPUs is visible to all CPUs b) A write that is visible to all CPUs is also visible to all other observers in the shareability domain This allows for using cheaper DMB instructions in the place of DSB for devices that are visible to all CPUs (i.e. devices that DPDK caters to). Please refer to [1], [2] and [3] for more information. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f [2] https://www.youtube.com/watch?v=i6DayghhA8Q [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h index 7b7099cdc..e42f69edc 100644 --- a/lib/librte_eal/arm/include/rte_atomic_64.h +++ b/lib/librte_eal/arm/include/rte_atomic_64.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2015 Cavium, Inc - * Copyright(c) 2019 Arm Limited + * Copyright(c) 2020 Arm Limited */ #ifndef _RTE_ATOMIC_ARM64_H_ @@ -19,11 +19,11 @@ extern "C" { #include <rte_compat.h> #include <rte_debug.h> -#define rte_mb() asm volatile("dsb sy" : : : "memory") +#define rte_mb() asm volatile("dmb osh" : : : "memory") -#define rte_wmb() asm volatile("dsb st" : : : "memory") +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") -#define rte_rmb() asm volatile("dsb ld" : : : "memory") +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") @@ -37,9 +37,9 @@ extern "C" { #define rte_io_rmb() rte_rmb() -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") +#define rte_cio_wmb() rte_wmb() -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") +#define rte_cio_rmb() rte_rmb() /*------------------------ 128 bit atomic operations -------------------------*/ -- 2.17.1
Hi Jerin,
You had a comment earlier about deprecating rte_cio_[rw]mb. Let me know if you are ok with this patch and I can add those changes (replace references to rte_cio_[rw]mb with rte_io_[rw]mb and a deprecation notice).
Thanks,
Honnappa
> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, June 27, 2020 2:12 PM
> To: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Ajit Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>; igorch@amazon.com;
> thomas@monjalon.net; viacheslavo@mellanox.com;
> arybchenko@solarflare.com; bruce.richardson@intel.com
> Cc: nd <nd@arm.com>
> Subject: [PATCH v2] eal: adjust barriers for IO on Armv8-a
>
> Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy
> atomicity memory model.
>
> Armv8-a memory model has been strengthened to require other-multi-copy
> atomicity. This property requires memory accesses from an observer to
> become visible to all other observers simultaneously [3]. This means
>
> a) A write arriving at an endpoint shared between multiple CPUs is
> visible to all CPUs
> b) A write that is visible to all CPUs is also visible to all other
> observers in the shareability domain
>
> This allows for using cheaper DMB instructions in the place of DSB for devices
> that are visible to all CPUs (i.e. devices that DPDK caters to).
>
> Please refer to [1], [2] and [3] for more information.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id
> =22ec71615d824f4f11d38d0e55a88d8956b7e45f
> [2] https://www.youtube.com/watch?v=i6DayghhA8Q
> [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
> lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h
> b/lib/librte_eal/arm/include/rte_atomic_64.h
> index 7b7099cdc..e42f69edc 100644
> --- a/lib/librte_eal/arm/include/rte_atomic_64.h
> +++ b/lib/librte_eal/arm/include/rte_atomic_64.h
> @@ -1,6 +1,6 @@
> /* SPDX-License-Identifier: BSD-3-Clause
> * Copyright(c) 2015 Cavium, Inc
> - * Copyright(c) 2019 Arm Limited
> + * Copyright(c) 2020 Arm Limited
> */
>
> #ifndef _RTE_ATOMIC_ARM64_H_
> @@ -19,11 +19,11 @@ extern "C" {
> #include <rte_compat.h>
> #include <rte_debug.h>
>
> -#define rte_mb() asm volatile("dsb sy" : : : "memory")
> +#define rte_mb() asm volatile("dmb osh" : : : "memory")
>
> -#define rte_wmb() asm volatile("dsb st" : : : "memory")
> +#define rte_wmb() asm volatile("dmb oshst" : : : "memory")
>
> -#define rte_rmb() asm volatile("dsb ld" : : : "memory")
> +#define rte_rmb() asm volatile("dmb oshld" : : : "memory")
>
> #define rte_smp_mb() asm volatile("dmb ish" : : : "memory")
>
> @@ -37,9 +37,9 @@ extern "C" {
>
> #define rte_io_rmb() rte_rmb()
>
> -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory")
> +#define rte_cio_wmb() rte_wmb()
>
> -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory")
> +#define rte_cio_rmb() rte_rmb()
>
> /*------------------------ 128 bit atomic operations -------------------------*/
>
> --
> 2.17.1
On Sun, Jun 28, 2020 at 12:55 AM Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote: > > Hi Jerin, > You had a comment earlier about deprecating rte_cio_[rw]mb. Let me know if you are ok with this patch and I can add those changes (replace references to rte_cio_[rw]mb with rte_io_[rw]mb and a deprecation notice). Acked-by: Jerin Jacob <jerinj@marvell.com> for this patch Please send the deprecation notice for overlapping rte_cio_* for 20.11 > > Thanks, > Honnappa > > > -----Original Message----- > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Sent: Saturday, June 27, 2020 2:12 PM > > To: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > > Ruifeng Wang <Ruifeng.Wang@arm.com>; jerinj@marvell.com; > > hemant.agrawal@nxp.com; Ajit Khaparde (ajit.khaparde@broadcom.com) > > <ajit.khaparde@broadcom.com>; igorch@amazon.com; > > thomas@monjalon.net; viacheslavo@mellanox.com; > > arybchenko@solarflare.com; bruce.richardson@intel.com > > Cc: nd <nd@arm.com> > > Subject: [PATCH v2] eal: adjust barriers for IO on Armv8-a > > > > Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy > > atomicity memory model. > > > > Armv8-a memory model has been strengthened to require other-multi-copy > > atomicity. This property requires memory accesses from an observer to > > become visible to all other observers simultaneously [3]. This means > > > > a) A write arriving at an endpoint shared between multiple CPUs is > > visible to all CPUs > > b) A write that is visible to all CPUs is also visible to all other > > observers in the shareability domain > > > > This allows for using cheaper DMB instructions in the place of DSB for devices > > that are visible to all CPUs (i.e. devices that DPDK caters to). > > > > Please refer to [1], [2] and [3] for more information. > > > > [1] > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id > > =22ec71615d824f4f11d38d0e55a88d8956b7e45f > > [2] https://www.youtube.com/watch?v=i6DayghhA8Q > > [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> > > --- > > lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------ > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h > > b/lib/librte_eal/arm/include/rte_atomic_64.h > > index 7b7099cdc..e42f69edc 100644 > > --- a/lib/librte_eal/arm/include/rte_atomic_64.h > > +++ b/lib/librte_eal/arm/include/rte_atomic_64.h > > @@ -1,6 +1,6 @@ > > /* SPDX-License-Identifier: BSD-3-Clause > > * Copyright(c) 2015 Cavium, Inc > > - * Copyright(c) 2019 Arm Limited > > + * Copyright(c) 2020 Arm Limited > > */ > > > > #ifndef _RTE_ATOMIC_ARM64_H_ > > @@ -19,11 +19,11 @@ extern "C" { > > #include <rte_compat.h> > > #include <rte_debug.h> > > > > -#define rte_mb() asm volatile("dsb sy" : : : "memory") > > +#define rte_mb() asm volatile("dmb osh" : : : "memory") > > > > -#define rte_wmb() asm volatile("dsb st" : : : "memory") > > +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") > > > > -#define rte_rmb() asm volatile("dsb ld" : : : "memory") > > +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") > > > > #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") > > > > @@ -37,9 +37,9 @@ extern "C" { > > > > #define rte_io_rmb() rte_rmb() > > > > -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") > > +#define rte_cio_wmb() rte_wmb() > > > > -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") > > +#define rte_cio_rmb() rte_rmb() > > > > /*------------------------ 128 bit atomic operations -------------------------*/ > > > > -- > > 2.17.1 >
Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy atomicity memory model. Armv8-a memory model has been strengthened to require other-multi-copy atomicity. This property requires memory accesses from an observer to become visible to all other observers simultaneously [3]. This means a) A write arriving at an endpoint shared between multiple CPUs is visible to all CPUs b) A write that is visible to all CPUs is also visible to all other observers in the shareability domain This allows for using cheaper DMB instructions in the place of DSB for devices that are visible to all CPUs (i.e. devices that DPDK caters to). Please refer to [1], [2] and [3] for more information. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f [2] https://www.youtube.com/watch?v=i6DayghhA8Q [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Jerin Jacob <jerinj@marvell.com> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h index 7b7099cdc..e42f69edc 100644 --- a/lib/librte_eal/arm/include/rte_atomic_64.h +++ b/lib/librte_eal/arm/include/rte_atomic_64.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2015 Cavium, Inc - * Copyright(c) 2019 Arm Limited + * Copyright(c) 2020 Arm Limited */ #ifndef _RTE_ATOMIC_ARM64_H_ @@ -19,11 +19,11 @@ extern "C" { #include <rte_compat.h> #include <rte_debug.h> -#define rte_mb() asm volatile("dsb sy" : : : "memory") +#define rte_mb() asm volatile("dmb osh" : : : "memory") -#define rte_wmb() asm volatile("dsb st" : : : "memory") +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") -#define rte_rmb() asm volatile("dsb ld" : : : "memory") +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") @@ -37,9 +37,9 @@ extern "C" { #define rte_io_rmb() rte_rmb() -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") +#define rte_cio_wmb() rte_wmb() -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") +#define rte_cio_rmb() rte_rmb() /*------------------------ 128 bit atomic operations -------------------------*/ -- 2.17.1
Updated the use of DMB instruction in rte_*mb APIs for Armv8-a. Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> --- doc/guides/rel_notes/release_20_08.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst index 5cbc4ce14..15c21996d 100644 --- a/doc/guides/rel_notes/release_20_08.rst +++ b/doc/guides/rel_notes/release_20_08.rst @@ -56,6 +56,13 @@ New Features Also, make sure to start the actual text at the margin. ========================================================= +* **rte_*mb APIs are updated to use DMB instruction.** + + Armv8-a memory model has been strengthened to require other-multi-copy + atomicity. This allows for using DMB instruction instead of DSB for IO + barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB + instruction to reflect this. + * **Updated PCAP driver.** Updated PCAP driver with new features and improvements, including: -- 2.17.1
rte_cio_*mb APIs will be deprecated in 20.11 release. Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> --- doc/guides/rel_notes/deprecation.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst index d1034f60f..59656da3d 100644 --- a/doc/guides/rel_notes/deprecation.rst +++ b/doc/guides/rel_notes/deprecation.rst @@ -40,6 +40,12 @@ Deprecation Notices These wrappers must be used for patches that need to be merged in 20.08 onwards. This change will not introduce any performance degradation. +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in + 20.11 release. + * igb_uio: In the view of reducing the kernel dependency from the main tree, as a first step, the Technical Board decided to move ``igb_uio`` kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory -- 2.17.1
On Sat, Jul 4, 2020 at 12:28 AM Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote: > > Updated the use of DMB instruction in rte_*mb APIs for Armv8-a. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > --- > doc/guides/rel_notes/release_20_08.rst | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst > index 5cbc4ce14..15c21996d 100644 > --- a/doc/guides/rel_notes/release_20_08.rst > +++ b/doc/guides/rel_notes/release_20_08.rst > @@ -56,6 +56,13 @@ New Features > Also, make sure to start the actual text at the margin. > ========================================================= > > +* **rte_*mb APIs are updated to use DMB instruction.** IMO, It is better to change to following as the end user can ignore parsing the description if not interested in arm64. rte_*mb APIs are updated to use DMB instruction for Armv8-a With above change: Acked-by: Jerin Jacob <jerinj@marvell.com> > + > + Armv8-a memory model has been strengthened to require other-multi-copy > + atomicity. This allows for using DMB instruction instead of DSB for IO > + barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB > + instruction to reflect this. > + > * **Updated PCAP driver.** > > Updated PCAP driver with new features and improvements, including: > -- > 2.17.1 >
On Sat, Jul 4, 2020 at 12:28 AM Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote: > > rte_cio_*mb APIs will be deprecated in 20.11 release. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Jerin Jacob <jerinj@marvell.com> > --- > doc/guides/rel_notes/deprecation.rst | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst > index d1034f60f..59656da3d 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -40,6 +40,12 @@ Deprecation Notices > These wrappers must be used for patches that need to be merged in 20.08 > onwards. This change will not introduce any performance degradation. > > +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB > + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb > + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used > + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in > + 20.11 release. > + > * igb_uio: In the view of reducing the kernel dependency from the main tree, > as a first step, the Technical Board decided to move ``igb_uio`` > kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory > -- > 2.17.1 >
Change the barrier APIs for IO to reflect that Armv8-a is other-multi-copy atomicity memory model. Armv8-a memory model has been strengthened to require other-multi-copy atomicity. This property requires memory accesses from an observer to become visible to all other observers simultaneously [3]. This means a) A write arriving at an endpoint shared between multiple CPUs is visible to all CPUs b) A write that is visible to all CPUs is also visible to all other observers in the shareability domain This allows for using cheaper DMB instructions in the place of DSB for devices that are visible to all CPUs (i.e. devices that DPDK caters to). Please refer to [1], [2] and [3] for more information. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f [2] https://www.youtube.com/watch?v=i6DayghhA8Q [3] https://www.cl.cam.ac.uk/~pes20/armv8-mca/ Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Jerin Jacob <jerinj@marvell.com> Tested-by: Ruifeng Wang <ruifeng.wang@arm.com> --- lib/librte_eal/arm/include/rte_atomic_64.h | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/librte_eal/arm/include/rte_atomic_64.h b/lib/librte_eal/arm/include/rte_atomic_64.h index 7b7099cdc..e42f69edc 100644 --- a/lib/librte_eal/arm/include/rte_atomic_64.h +++ b/lib/librte_eal/arm/include/rte_atomic_64.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: BSD-3-Clause * Copyright(c) 2015 Cavium, Inc - * Copyright(c) 2019 Arm Limited + * Copyright(c) 2020 Arm Limited */ #ifndef _RTE_ATOMIC_ARM64_H_ @@ -19,11 +19,11 @@ extern "C" { #include <rte_compat.h> #include <rte_debug.h> -#define rte_mb() asm volatile("dsb sy" : : : "memory") +#define rte_mb() asm volatile("dmb osh" : : : "memory") -#define rte_wmb() asm volatile("dsb st" : : : "memory") +#define rte_wmb() asm volatile("dmb oshst" : : : "memory") -#define rte_rmb() asm volatile("dsb ld" : : : "memory") +#define rte_rmb() asm volatile("dmb oshld" : : : "memory") #define rte_smp_mb() asm volatile("dmb ish" : : : "memory") @@ -37,9 +37,9 @@ extern "C" { #define rte_io_rmb() rte_rmb() -#define rte_cio_wmb() asm volatile("dmb oshst" : : : "memory") +#define rte_cio_wmb() rte_wmb() -#define rte_cio_rmb() asm volatile("dmb oshld" : : : "memory") +#define rte_cio_rmb() rte_rmb() /*------------------------ 128 bit atomic operations -------------------------*/ -- 2.17.1
Updated the use of DMB instruction in rte_*mb APIs for Armv8-a. Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Jerin Jacob <jerinj@marvell.com> --- doc/guides/rel_notes/release_20_08.rst | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst index 5cbc4ce14..567ae6b2a 100644 --- a/doc/guides/rel_notes/release_20_08.rst +++ b/doc/guides/rel_notes/release_20_08.rst @@ -56,6 +56,13 @@ New Features Also, make sure to start the actual text at the margin. ========================================================= +* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.** + + Armv8-a memory model has been strengthened to require other-multi-copy + atomicity. This allows for using DMB instruction instead of DSB for IO + barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB + instruction to reflect this. + * **Updated PCAP driver.** Updated PCAP driver with new features and improvements, including: -- 2.17.1
rte_cio_*mb APIs will be deprecated in 20.11 release. Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Jerin Jacob <jerinj@marvell.com> --- doc/guides/rel_notes/deprecation.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst index d1034f60f..59656da3d 100644 --- a/doc/guides/rel_notes/deprecation.rst +++ b/doc/guides/rel_notes/deprecation.rst @@ -40,6 +40,12 @@ Deprecation Notices These wrappers must be used for patches that need to be merged in 20.08 onwards. This change will not introduce any performance degradation. +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in + 20.11 release. + * igb_uio: In the view of reducing the kernel dependency from the main tree, as a first step, the Technical Board decided to move ``igb_uio`` kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory -- 2.17.1
On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote: > > Updated the use of DMB instruction in rte_*mb APIs for Armv8-a. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Acked-by: Jerin Jacob <jerinj@marvell.com> > --- > doc/guides/rel_notes/release_20_08.rst | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/doc/guides/rel_notes/release_20_08.rst b/doc/guides/rel_notes/release_20_08.rst > index 5cbc4ce14..567ae6b2a 100644 > --- a/doc/guides/rel_notes/release_20_08.rst > +++ b/doc/guides/rel_notes/release_20_08.rst > @@ -56,6 +56,13 @@ New Features > Also, make sure to start the actual text at the margin. > ========================================================= > This release note update will be squashed with the change itself in patch 1. > +* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.** We use "ARMv8" in the release notes, any objection if I update this when applying? > + > + Armv8-a memory model has been strengthened to require other-multi-copy > + atomicity. This allows for using DMB instruction instead of DSB for IO > + barriers. rte_*mb APIs, for Armv8-a platforms, are changed to use DMB > + instruction to reflect this. > + > * **Updated PCAP driver.** > > Updated PCAP driver with new features and improvements, including: > -- > 2.17.1 > -- David Marchand
On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote: > > rte_cio_*mb APIs will be deprecated in 20.11 release. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Acked-by: Jerin Jacob <jerinj@marvell.com> > --- > doc/guides/rel_notes/deprecation.rst | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst > index d1034f60f..59656da3d 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -40,6 +40,12 @@ Deprecation Notices > These wrappers must be used for patches that need to be merged in 20.08 > onwards. This change will not introduce any performance degradation. > > +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB > + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb > + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used Nit: missing space. > + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in > + 20.11 release. > + > * igb_uio: In the view of reducing the kernel dependency from the main tree, > as a first step, the Technical Board decided to move ``igb_uio`` > kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory > -- > 2.17.1 > LGTM. We need 3 acks (ideally from different vendors/companies) for a deprecation notice. Please maintainers? Thanks. -- David Marchand
<snip> > Subject: Re: [dpdk-dev] [PATCH v4 2/3] doc: update armv8-a IO barrier > changes > > On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli > <honnappa.nagarahalli@arm.com> wrote: > > > > Updated the use of DMB instruction in rte_*mb APIs for Armv8-a. > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > > Acked-by: Jerin Jacob <jerinj@marvell.com> > > --- > > doc/guides/rel_notes/release_20_08.rst | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/doc/guides/rel_notes/release_20_08.rst > > b/doc/guides/rel_notes/release_20_08.rst > > index 5cbc4ce14..567ae6b2a 100644 > > --- a/doc/guides/rel_notes/release_20_08.rst > > +++ b/doc/guides/rel_notes/release_20_08.rst > > @@ -56,6 +56,13 @@ New Features > > Also, make sure to start the actual text at the margin. > > ========================================================= > > > > This release note update will be squashed with the change itself in patch 1. > > > +* **rte_*mb APIs are updated to use DMB instruction for Armv8-a.** > > We use "ARMv8" in the release notes, any objection if I update this when > applying? No objections. > > > > + > > + Armv8-a memory model has been strengthened to require > > + other-multi-copy atomicity. This allows for using DMB instruction > > + instead of DSB for IO barriers. rte_*mb APIs, for Armv8-a > > + platforms, are changed to use DMB instruction to reflect this. > > + > > * **Updated PCAP driver.** > > > > Updated PCAP driver with new features and improvements, including: > > -- > > 2.17.1 > > > > > -- > David Marchand
On 7/6/20 4:43 PM, Honnappa Nagarahalli wrote:
> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
> ---
> doc/guides/rel_notes/deprecation.rst | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
> index d1034f60f..59656da3d 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -40,6 +40,12 @@ Deprecation Notices
> These wrappers must be used for patches that need to be merged in 20.08
> onwards. This change will not introduce any performance degradation.
>
> +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB
> + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb
> + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used
> + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in
> + 20.11 release.
> +
No difference between rte_cio_* and rte_io_* macros on PPC.
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
On Fri, Jul 3, 2020 at 11:58 AM Honnappa Nagarahalli < honnappa.nagarahalli@arm.com> wrote: > rte_cio_*mb APIs will be deprecated in 20.11 release. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com> --- > doc/guides/rel_notes/deprecation.rst | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/doc/guides/rel_notes/deprecation.rst > b/doc/guides/rel_notes/deprecation.rst > index d1034f60f..59656da3d 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -40,6 +40,12 @@ Deprecation Notices > These wrappers must be used for patches that need to be merged in 20.08 > onwards. This change will not introduce any performance degradation. > > +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed > from DSB > + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb > + APIs(taking all platforms into consideration). rte_io_*mb APIs should > be used > + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be > deprecated in > + 20.11 release. > + > * igb_uio: In the view of reducing the kernel dependency from the main > tree, > as a first step, the Technical Board decided to move ``igb_uio`` > kernel module to the dpdk-kmods repository in the /linux/igb_uio/ > directory > -- > 2.17.1 > >
> -----Original Message----- > From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa Nagarahalli > Sent: Friday, July 3, 2020 7:58 PM > To: dev@dpdk.org; honnappa.nagarahalli@arm.com; ruifeng.wang@arm.com; jerinj@marvell.com; hemant.agrawal@nxp.com; > ajit.khaparde@broadcom.com; igorch@amazon.com; thomas@monjalon.net; viacheslavo@mellanox.com; arybchenko@solarflare.com; > Richardson, Bruce <bruce.richardson@intel.com> > Cc: nd@arm.com > Subject: [dpdk-dev] [PATCH v3 3/3] doc: update deprecation of CIO barrier APIs > > rte_cio_*mb APIs will be deprecated in 20.11 release. > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > --- > doc/guides/rel_notes/deprecation.rst | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst > index d1034f60f..59656da3d 100644 > --- a/doc/guides/rel_notes/deprecation.rst > +++ b/doc/guides/rel_notes/deprecation.rst > @@ -40,6 +40,12 @@ Deprecation Notices > These wrappers must be used for patches that need to be merged in 20.08 > onwards. This change will not introduce any performance degradation. > > +* rte_cio_*mb: Since the IO barriers for ArmV8-a platforms are relaxed from DSB > + to DMB, rte_cio_*mb APIs provide the same functionality as rte_io_*mb > + APIs(taking all platforms into consideration). rte_io_*mb APIs should be used > + in the place of rte_cio_*mb APIs. The rte_cio_*mb APIs will be deprecated in > + 20.11 release. > + > * igb_uio: In the view of reducing the kernel dependency from the main tree, > as a first step, the Technical Board decided to move ``igb_uio`` > kernel module to the dpdk-kmods repository in the /linux/igb_uio/ directory > -- Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com> > 2.17.1
On Tue, Jul 7, 2020 at 1:44 AM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> rte_cio_*mb APIs will be deprecated in 20.11 release.
>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Jerin Jacob <jerinj@marvell.com>
Reviewed-by: David Christensen <drc@linux.vnet.ibm.com>
Acked-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Series applied, thanks.
--
David Marchand
Hi,
We are also doing C11 atomics converting for other components.
Your insight would be much appreciated.
Thanks,
Phil Yang
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, June 23, 2020 4:27 PM
> To: dev@dpdk.org
> Cc: matan@mellanox.com; shahafs@mellanox.com;
> viacheslavo@mellanox.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> <nd@arm.com>
> Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet
> RQ buffer refcnt
>
> Use c11 atomics with explicit ordering instead of the rte_atomic ops
> which enforce unnecessary barriers on aarch64.
>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> ---
> v3:
> Split from the patchset:
> http://patchwork.dpdk.org/cover/68159/
>
> drivers/net/mlx5/mlx5_rxq.c | 2 +-
> drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> 3 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> index dda0073..7f487f1 100644
> --- a/drivers/net/mlx5/mlx5_rxq.c
> +++ b/drivers/net/mlx5/mlx5_rxq.c
> @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void *opaque_arg,
>
> memset(_m, 0, sizeof(*buf));
> buf->mp = mp;
> - rte_atomic16_set(&buf->refcnt, 1);
> + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> for (j = 0; j != strd_n; ++j) {
> shinfo = &buf->shinfos[j];
> shinfo->free_cb = mlx5_mprq_buf_free_cb;
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index e4106bf..f0eda88 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused, void *opaque)
> {
> struct mlx5_mprq_buf *buf = opaque;
>
> - if (rte_atomic16_read(&buf->refcnt) == 1) {
> + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> rte_mempool_put(buf->mp, buf);
> - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> - rte_atomic16_set(&buf->refcnt, 1);
> + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> + __ATOMIC_RELAXED) == 0)) {
> + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> rte_mempool_put(buf->mp, buf);
> }
> }
> @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
>
> if (consumed_strd == strd_n) {
> /* Replace WQE only if the buffer is still in use. */
> - if (rte_atomic16_read(&buf->refcnt) > 1) {
> + if (__atomic_load_n(&buf->refcnt,
> + __ATOMIC_RELAXED) > 1) {
> mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
> /* Release the old buffer. */
> mlx5_mprq_buf_free(buf);
> @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> rte_mbuf **pkts, uint16_t pkts_n)
> void *buf_addr;
>
> /* Increment the refcnt of the whole chunk. */
> - rte_atomic16_add_return(&buf->refcnt, 1);
> - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> >refcnt) <=
> - strd_n + 1);
> + __atomic_add_fetch(&buf->refcnt, 1,
> __ATOMIC_ACQUIRE);
> + MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> + __ATOMIC_RELAXED) <= strd_n + 1);
> buf_addr = RTE_PTR_SUB(addr,
> RTE_PKTMBUF_HEADROOM);
> /*
> * MLX5 device doesn't use iova but it is necessary in
> a
> diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> index 26621ff..0fc15f3 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.h
> +++ b/drivers/net/mlx5/mlx5_rxtx.h
> @@ -78,7 +78,7 @@ struct rxq_zip {
> /* Multi-Packet RQ buffer header. */
> struct mlx5_mprq_buf {
> struct rte_mempool *mp;
> - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> + uint16_t refcnt; /* Atomically accessed refcnt. */
> uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet. */
> struct rte_mbuf_ext_shared_info shinfos[];
> /*
> --
> 2.7.4
Hi Phil Yang, we noticed that this patch gives us 10% of performance degradation on ARM.
x86 seems to be unaffected though. Do you know what may be the reason of this behavior?
Regards,
Alex
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Sunday, July 12, 2020 23:02
> To: Matan Azrad <matan@mellanox.com>; Shahaf Shuler
> <shahafs@mellanox.com>; Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> drc@linux.vnet.ibm.com; nd <nd@arm.com>; Phil Yang <Phil.Yang@arm.com>;
> dev@dpdk.org; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi-packet
> RQ buffer refcnt
>
> Hi,
>
> We are also doing C11 atomics converting for other components.
> Your insight would be much appreciated.
>
> Thanks,
> Phil Yang
>
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > Sent: Tuesday, June 23, 2020 4:27 PM
> > To: dev@dpdk.org
> > Cc: matan@mellanox.com; shahafs@mellanox.com;
> > viacheslavo@mellanox.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd
> > <nd@arm.com>
> > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > multi-packet RQ buffer refcnt
> >
> > Use c11 atomics with explicit ordering instead of the rte_atomic ops
> > which enforce unnecessary barriers on aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > ---
> > v3:
> > Split from the patchset:
> > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> >
> work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%40m
> ellano
> >
> x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d1
> 49256
> >
> f461b%7C0%7C0%7C637302061620808255&sdata=mRXbgPi6HyrVtP04Vl7
> Bx8lD0
> > trVP7noQlpOD7gBoTQ%3D&reserved=0
> >
> > drivers/net/mlx5/mlx5_rxq.c | 2 +-
> > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> > 3 files changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > index dda0073..7f487f1 100644
> > --- a/drivers/net/mlx5/mlx5_rxq.c
> > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void
> > *opaque_arg,
> >
> > memset(_m, 0, sizeof(*buf));
> > buf->mp = mp;
> > - rte_atomic16_set(&buf->refcnt, 1);
> > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > for (j = 0; j != strd_n; ++j) {
> > shinfo = &buf->shinfos[j];
> > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > e4106bf..f0eda88 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr
> __rte_unused,
> > void *opaque) {
> > struct mlx5_mprq_buf *buf = opaque;
> >
> > - if (rte_atomic16_read(&buf->refcnt) == 1) {
> > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > rte_mempool_put(buf->mp, buf);
> > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > - rte_atomic16_set(&buf->refcnt, 1);
> > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > + __ATOMIC_RELAXED) == 0)) {
> > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > rte_mempool_put(buf->mp, buf);
> > }
> > }
> > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> >
> > if (consumed_strd == strd_n) {
> > /* Replace WQE only if the buffer is still in use. */
> > - if (rte_atomic16_read(&buf->refcnt) > 1) {
> > + if (__atomic_load_n(&buf->refcnt,
> > + __ATOMIC_RELAXED) > 1) {
> > mprq_buf_replace(rxq, rq_ci & wq_mask,
> strd_n);
> > /* Release the old buffer. */
> > mlx5_mprq_buf_free(buf);
> > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > rte_mbuf **pkts, uint16_t pkts_n)
> > void *buf_addr;
> >
> > /* Increment the refcnt of the whole chunk. */
> > - rte_atomic16_add_return(&buf->refcnt, 1);
> > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > >refcnt) <=
> > - strd_n + 1);
> > + __atomic_add_fetch(&buf->refcnt, 1,
> > __ATOMIC_ACQUIRE);
> > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > + __ATOMIC_RELAXED) <= strd_n + 1);
> > buf_addr = RTE_PTR_SUB(addr,
> > RTE_PKTMBUF_HEADROOM);
> > /*
> > * MLX5 device doesn't use iova but it is necessary in a
> diff
> > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > index 26621ff..0fc15f3 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > @@ -78,7 +78,7 @@ struct rxq_zip {
> > /* Multi-Packet RQ buffer header. */
> > struct mlx5_mprq_buf {
> > struct rte_mempool *mp;
> > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > + uint16_t refcnt; /* Atomically accessed refcnt. */
> > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> packet.
> > */
> > struct rte_mbuf_ext_shared_info shinfos[];
> > /*
> > --
> > 2.7.4
Alexander Kozyrev <akozyrev@mellanox.com> writes: <...> > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi- > packet RQ buffer refcnt > > Hi Phil Yang, we noticed that this patch gives us 10% of performance > degradation on ARM. > x86 seems to be unaffected though. Do you know what may be the reason > of this behavior? Hi Alexander, Thanks for your feedback. This patch removed some expensive memory barriers on aarch64, it should get better performance. I am not sure the root cause of this degradation now, I will start the investigation. We can profiling this issue together. Could you share your test case(including your testbed configuration) with us? Thanks, Phil > > Regards, > Alex > > > -----Original Message----- > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > > Sent: Sunday, July 12, 2020 23:02 > > To: Matan Azrad <matan@mellanox.com>; Shahaf Shuler > > <shahafs@mellanox.com>; Slava Ovsiienko <viacheslavo@mellanox.com> > > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; > > drc@linux.vnet.ibm.com; nd <nd@arm.com>; Phil Yang > <Phil.Yang@arm.com>; > > dev@dpdk.org; nd <nd@arm.com> > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for multi- > packet > > RQ buffer refcnt > > > > Hi, > > > > We are also doing C11 atomics converting for other components. > > Your insight would be much appreciated. > > > > Thanks, > > Phil Yang > > > > > -----Original Message----- > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > > > Sent: Tuesday, June 23, 2020 4:27 PM > > > To: dev@dpdk.org > > > Cc: matan@mellanox.com; shahafs@mellanox.com; > > > viacheslavo@mellanox.com; Honnappa Nagarahalli > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd > > > <nd@arm.com> > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > multi-packet RQ buffer refcnt > > > > > > Use c11 atomics with explicit ordering instead of the rte_atomic ops > > > which enforce unnecessary barriers on aarch64. > > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > --- > > > v3: > > > Split from the patchset: > > > > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch > > > > > > work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%40 > m > > ellano > > > > > > x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d > 1 > > 49256 > > > > > > f461b%7C0%7C0%7C637302061620808255&sdata=mRXbgPi6HyrVtP04Vl > 7 > > Bx8lD0 > > > trVP7noQlpOD7gBoTQ%3D&reserved=0 > > > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +- > > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- > > > drivers/net/mlx5/mlx5_rxtx.h | 2 +- > > > 3 files changed, 11 insertions(+), 9 deletions(-) > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c > > > index dda0073..7f487f1 100644 > > > --- a/drivers/net/mlx5/mlx5_rxq.c > > > +++ b/drivers/net/mlx5/mlx5_rxq.c > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, > void > > > *opaque_arg, > > > > > > memset(_m, 0, sizeof(*buf)); > > > buf->mp = mp; > > > - rte_atomic16_set(&buf->refcnt, 1); > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > for (j = 0; j != strd_n; ++j) { > > > shinfo = &buf->shinfos[j]; > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index > > > e4106bf..f0eda88 100644 > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr > > __rte_unused, > > > void *opaque) { > > > struct mlx5_mprq_buf *buf = opaque; > > > > > > - if (rte_atomic16_read(&buf->refcnt) == 1) { > > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > > > rte_mempool_put(buf->mp, buf); > > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > > > - rte_atomic16_set(&buf->refcnt, 1); > > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > > > + __ATOMIC_RELAXED) == 0)) { > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > rte_mempool_put(buf->mp, buf); > > > } > > > } > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > rte_mbuf **pkts, uint16_t pkts_n) > > > > > > if (consumed_strd == strd_n) { > > > /* Replace WQE only if the buffer is still in use. */ > > > - if (rte_atomic16_read(&buf->refcnt) > 1) { > > > + if (__atomic_load_n(&buf->refcnt, > > > + __ATOMIC_RELAXED) > 1) { > > > mprq_buf_replace(rxq, rq_ci & wq_mask, > > strd_n); > > > /* Release the old buffer. */ > > > mlx5_mprq_buf_free(buf); > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > rte_mbuf **pkts, uint16_t pkts_n) > > > void *buf_addr; > > > > > > /* Increment the refcnt of the whole chunk. */ > > > - rte_atomic16_add_return(&buf->refcnt, 1); > > > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > >refcnt) <= > > > - strd_n + 1); > > > + __atomic_add_fetch(&buf->refcnt, 1, > > > __ATOMIC_ACQUIRE); > > > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > buf_addr = RTE_PTR_SUB(addr, > > > RTE_PKTMBUF_HEADROOM); > > > /* > > > * MLX5 device doesn't use iova but it is necessary in > a > > diff > > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > > > index 26621ff..0fc15f3 100644 > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > @@ -78,7 +78,7 @@ struct rxq_zip { > > > /* Multi-Packet RQ buffer header. */ > > > struct mlx5_mprq_buf { > > > struct rte_mempool *mp; > > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ > > > + uint16_t refcnt; /* Atomically accessed refcnt. */ > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > > packet. > > > */ > > > struct rte_mbuf_ext_shared_info shinfos[]; > > > /* > > > -- > > > 2.7.4
> Phil Yang <Phil.Yang@arm.com> writes:
>
> <...>
> > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > multi- packet RQ buffer refcnt
> >
> > Hi Phil Yang, we noticed that this patch gives us 10% of performance
> > degradation on ARM.
> > x86 seems to be unaffected though. Do you know what may be the reason
> > of this behavior?
>
> Hi Alexander,
>
> Thanks for your feedback.
> This patch removed some expensive memory barriers on aarch64, it should get
> better performance.
> I am not sure the root cause of this degradation now, I will start the
> investigation. We can profiling this issue together.
> Could you share your test case(including your testbed configuration) with us?
>
> Thanks,
> Phil
I'm surprised too, Phil, but looks like it is actually making things worse. I used Connect-X 6DX on aarch64:
Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux
Traffic generator sends 60 bytes packets and DUT executes the following command:
arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 -w 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1 -w 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe -- --burst=64 --mbcache=512 -i --nb-cores=1 --rxq=1 --txq=1 --txd=256 --rxd=256 --auto-start --rss-udp
Without a patch I'm getting 3.2mpps, and only 2.9mpps when the patch is applied.
Regards,
Alex
<snip> > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > multi- packet RQ buffer refcnt > > > > > > Hi Phil Yang, we noticed that this patch gives us 10% of performance > > > degradation on ARM. > > > x86 seems to be unaffected though. Do you know what may be the > > > reason of this behavior? > > > > Hi Alexander, > > > > Thanks for your feedback. > > This patch removed some expensive memory barriers on aarch64, it > > should get better performance. > > I am not sure the root cause of this degradation now, I will start the > > investigation. We can profiling this issue together. > > Could you share your test case(including your testbed configuration) with > us? > > > > Thanks, > > Phil > > I'm surprised too, Phil, but looks like it is actually making things worse. I used > Connect-X 6DX on aarch64: > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2 > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator > sends 60 bytes packets and DUT executes the following command: > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 -w > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1 -w > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe -- --burst=64 -- > mbcache=512 -i --nb-cores=1 --rxq=1 --txq=1 --txd=256 --rxd=256 --auto- > start --rss-udp Without a patch I'm getting 3.2mpps, and only 2.9mpps when > the patch is applied. You are running on A72 cores, is that correct? > > Regards, > Alex
<snip>
>
> > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for
> > > > multi- packet RQ buffer refcnt
> > > >
> > > > Hi Phil Yang, we noticed that this patch gives us 10% of
> > > > performance degradation on ARM.
> > > > x86 seems to be unaffected though. Do you know what may be the
> > > > reason of this behavior?
> > >
> > > Hi Alexander,
> > >
> > > Thanks for your feedback.
> > > This patch removed some expensive memory barriers on aarch64, it
> > > should get better performance.
> > > I am not sure the root cause of this degradation now, I will start
> > > the investigation. We can profiling this issue together.
> > > Could you share your test case(including your testbed configuration)
> > > with
> > us?
> > >
> > > Thanks,
> > > Phil
> >
> > I'm surprised too, Phil, but looks like it is actually making things
> > worse. I used Connect-X 6DX on aarch64:
> > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2
> > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator
> > sends 60 bytes packets and DUT executes the following command:
> > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 -w
> > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1 -w
> > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe -- --burst=64 --
> > mbcache=512 -i --nb-cores=1 --rxq=1 --txq=1 --txd=256 --rxd=256
> > --auto- start --rss-udp Without a patch I'm getting 3.2mpps, and only
> > 2.9mpps when the patch is applied.
> You are running on A72 cores, is that correct?
Correct, cat /proc/cpuinfo
processor : 0
BogoMIPS : 312.50
Features : fp asimd evtstrm crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd08
CPU revision : 3
Alexander Kozyrev <akozyrev@mellanox.com> writes: <snip> > > > > > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > > > multi- packet RQ buffer refcnt > > > > > > > > > > Hi Phil Yang, we noticed that this patch gives us 10% of > > > > > performance degradation on ARM. > > > > > x86 seems to be unaffected though. Do you know what may be the > > > > > reason of this behavior? > > > > > > > > Hi Alexander, > > > > > > > > Thanks for your feedback. > > > > This patch removed some expensive memory barriers on aarch64, it > > > > should get better performance. > > > > I am not sure the root cause of this degradation now, I will start > > > > the investigation. We can profiling this issue together. > > > > Could you share your test case(including your testbed configuration) > > > > with > > > us? <...> > > > > > > I'm surprised too, Phil, but looks like it is actually making things > > > worse. I used Connect-X 6DX on aarch64: > > > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2 > > > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator > > > sends 60 bytes packets and DUT executes the following command: > > > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 -w > > > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1 -w > > > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe -- --burst=64 -- > > > mbcache=512 -i --nb-cores=1 --rxq=1 --txq=1 --txd=256 --rxd=256 > > > --auto- start --rss-udp Without a patch I'm getting 3.2mpps, and only > > > 2.9mpps when the patch is applied. > > You are running on A72 cores, is that correct? > > Correct, cat /proc/cpuinfo > processor : 0 > BogoMIPS : 312.50 > Features : fp asimd evtstrm crc32 cpuid > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x0 > CPU part : 0xd08 > CPU revision : 3 Thanks a lot for your input, Alex. With your test command line, I remeasured this patch on two different aarch64 machines and both got some performance improvement. SOC#1. On Thunderx2 (with LSE support), I see 7.6% performance improvement on throughput. NIC: ConnectX-6 / driver: mlx5_core version: 5.0-1.0.0.0 / firmware-version: 20.27.1016 (MT_0000000224) SOC#2. On N1SDP (I disabled LSE to generate A72 likewise instructions), I also see slightly (about 1%~2%) performance improvement on throughput. NIC: ConnectX-5 / driver: mlx5_core / version: 5.0-2.1.8 / firmware-version: 16.27.2008 (MT_0000000090) Without LSE (i.e. A72 and SOC#2 case.) it uses the 'Exclusive' mechanism to achieve atomicity. For example, it generates below instructions for __atomic_add_fetch. __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE); 70118: f94037e3 ldr x3, [sp, #104] 7011c: 91002060 add x0, x3, #0x8 70120: 485ffc02 ldaxrh w2, [x0] 70124: 11000442 add w2, w2, #0x1 70128: 48057c02 stxrh w5, w2, [x0] 7012c: 35ffffa5 cbnz w5, 70120 <mlx5_rx_burst_mprq+0xb48> In general, I think this patch will not lead to a sharp decline in performance. Maybe you can try other testbeds? Thanks, Phil
Hi Alexander, Thank you for testing this patch. Few comments below. <snip> > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > multi-packet RQ buffer refcnt > > > > Hi, > > > > We are also doing C11 atomics converting for other components. > > Your insight would be much appreciated. > > > > Thanks, > > Phil Yang > > > > > -----Original Message----- > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > > > Sent: Tuesday, June 23, 2020 4:27 PM > > > To: dev@dpdk.org > > > Cc: matan@mellanox.com; shahafs@mellanox.com; > > > viacheslavo@mellanox.com; Honnappa Nagarahalli > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd > > > <nd@arm.com> > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > multi-packet RQ buffer refcnt > > > > > > Use c11 atomics with explicit ordering instead of the rte_atomic ops > > > which enforce unnecessary barriers on aarch64. > > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > --- > > > v3: > > > Split from the patchset: > > > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpat > > > ch > > > > > > work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%40m > > ellano > > > > > > x.com%7C1e3dc839a3604924fdf208d826d934ad%7Ca652971c7d2e4d9ba6a4d1 > > 49256 > > > > > > f461b%7C0%7C0%7C637302061620808255&sdata=mRXbgPi6HyrVtP04Vl7 > > Bx8lD0 > > > trVP7noQlpOD7gBoTQ%3D&reserved=0 > > > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c > > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- > > > 3 files changed, 11 insertions(+), 9 deletions(-) > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c > > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644 > > > --- a/drivers/net/mlx5/mlx5_rxq.c > > > +++ b/drivers/net/mlx5/mlx5_rxq.c > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, > > > void *opaque_arg, > > > > > > memset(_m, 0, sizeof(*buf)); > > > buf->mp = mp; > > > - rte_atomic16_set(&buf->refcnt, 1); > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > for (j = 0; j != strd_n; ++j) { > > > shinfo = &buf->shinfos[j]; > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index > > > e4106bf..f0eda88 100644 > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr > > __rte_unused, > > > void *opaque) { > > > struct mlx5_mprq_buf *buf = opaque; > > > > > > - if (rte_atomic16_read(&buf->refcnt) == 1) { > > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > > > rte_mempool_put(buf->mp, buf); > > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > > > - rte_atomic16_set(&buf->refcnt, 1); > > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > > > + __ATOMIC_RELAXED) == 0)) { > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > rte_mempool_put(buf->mp, buf); > > > } > > > } > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > rte_mbuf **pkts, uint16_t pkts_n) > > > > > > if (consumed_strd == strd_n) { > > > /* Replace WQE only if the buffer is still in use. */ > > > - if (rte_atomic16_read(&buf->refcnt) > 1) { > > > + if (__atomic_load_n(&buf->refcnt, > > > + __ATOMIC_RELAXED) > 1) { > > > mprq_buf_replace(rxq, rq_ci & wq_mask, > > strd_n); > > > /* Release the old buffer. */ > > > mlx5_mprq_buf_free(buf); > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > rte_mbuf **pkts, uint16_t pkts_n) > > > void *buf_addr; > > > > > > /* Increment the refcnt of the whole chunk. */ > > > - rte_atomic16_add_return(&buf->refcnt, 1); rte_atomic16_add_return includes a full barrier along with atomic operation. But is full barrier required here? For ex: __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be enough? > > > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > >refcnt) <= > > > - strd_n + 1); > > > + __atomic_add_fetch(&buf->refcnt, 1, > > > __ATOMIC_ACQUIRE); Can you replace just the above line with the following lines and test it? __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); __atomic_thread_fence(__ATOMIC_ACQ_REL); This should make the generated code same as before this patch. Let me know if you would prefer us to re-spin the patch instead (for testing). > > > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > buf_addr = RTE_PTR_SUB(addr, > > > RTE_PKTMBUF_HEADROOM); > > > /* > > > * MLX5 device doesn't use iova but it is necessary in a > > diff > > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > > > index 26621ff..0fc15f3 100644 > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > @@ -78,7 +78,7 @@ struct rxq_zip { > > > /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { > > > struct rte_mempool *mp; > > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ > > > + uint16_t refcnt; /* Atomically accessed refcnt. */ > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > > packet. > > > */ > > > struct rte_mbuf_ext_shared_info shinfos[]; > > > /* > > > -- > > > 2.7.4
<snip> > > > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > multi-packet RQ buffer refcnt > > > > > > Hi, > > > > > > We are also doing C11 atomics converting for other components. > > > Your insight would be much appreciated. > > > > > > Thanks, > > > Phil Yang > > > > > > > -----Original Message----- > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > > > > Sent: Tuesday, June 23, 2020 4:27 PM > > > > To: dev@dpdk.org > > > > Cc: matan@mellanox.com; shahafs@mellanox.com; > > > > viacheslavo@mellanox.com; Honnappa Nagarahalli > > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd > > > > <nd@arm.com> > > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > > multi-packet RQ buffer refcnt > > > > > > > > Use c11 atomics with explicit ordering instead of the rte_atomic ops > > > > which enforce unnecessary barriers on aarch64. > > > > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > > --- <...> > > > > > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c > > > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- > > > > 3 files changed, 11 insertions(+), 9 deletions(-) > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c > > > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644 > > > > --- a/drivers/net/mlx5/mlx5_rxq.c > > > > +++ b/drivers/net/mlx5/mlx5_rxq.c > > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool > *mp, > > > > void *opaque_arg, > > > > > > > > memset(_m, 0, sizeof(*buf)); > > > > buf->mp = mp; > > > > -rte_atomic16_set(&buf->refcnt, 1); > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > for (j = 0; j != strd_n; ++j) { > > > > shinfo = &buf->shinfos[j]; > > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index > > > > e4106bf..f0eda88 100644 > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr > > > __rte_unused, > > > > void *opaque) { > > > > struct mlx5_mprq_buf *buf = opaque; > > > > > > > > -if (rte_atomic16_read(&buf->refcnt) == 1) { > > > > +if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > > > > rte_mempool_put(buf->mp, buf); > > > > -} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > > > > -rte_atomic16_set(&buf->refcnt, 1); > > > > +} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > > > > + __ATOMIC_RELAXED) == 0)) { > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > rte_mempool_put(buf->mp, buf); > > > > } > > > > } > > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > > rte_mbuf **pkts, uint16_t pkts_n) > > > > > > > > if (consumed_strd == strd_n) { > > > > /* Replace WQE only if the buffer is still in use. */ > > > > -if (rte_atomic16_read(&buf->refcnt) > 1) { > > > > +if (__atomic_load_n(&buf->refcnt, > > > > + __ATOMIC_RELAXED) > 1) { > > > > mprq_buf_replace(rxq, rq_ci & wq_mask, > > > strd_n); > > > > /* Release the old buffer. */ > > > > mlx5_mprq_buf_free(buf); > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > > > rte_mbuf **pkts, uint16_t pkts_n) > > > > void *buf_addr; > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > rte_atomic16_add_return includes a full barrier along with atomic operation. > But is full barrier required here? For ex: __atomic_add_fetch(&buf->refcnt, 1, > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be > enough? > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > >refcnt) <= > > > > - strd_n + 1); > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > __ATOMIC_ACQUIRE); The atomic load in MLX5_ASSERT() accesses the same memory space as the previous __atomic_add_fetch() does. They will access this memory space in the program order when we enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch() becomes unnecessary. By changing it to RELAXED ordering, this patch got 7.6% performance improvement on N1 (making it generate A72 alike instructions). Could you please also try it on your testbed, Alex? > > Can you replace just the above line with the following lines and test it? > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > This should make the generated code same as before this patch. Let me > know if you would prefer us to re-spin the patch instead (for testing). > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > buf_addr = RTE_PTR_SUB(addr, > > > > RTE_PKTMBUF_HEADROOM); > > > > /* > > > > * MLX5 device doesn't use iova but it is necessary in a > > > diff > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > > > > index 26621ff..0fc15f3 100644 > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > > @@ -78,7 +78,7 @@ struct rxq_zip { > > > > /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { > > > > struct rte_mempool *mp; > > > > -rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ > > > > +uint16_t refcnt; /* Atomically accessed refcnt. */ > > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > > > packet. > > > > */ > > > > struct rte_mbuf_ext_shared_info shinfos[]; > > > > /* > > > > -- > > > > 2.7.4 >
> <snip> > > > > > > Subject: Re: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > > multi-packet RQ buffer refcnt > > > > > > > > Hi, > > > > > > > > We are also doing C11 atomics converting for other components. > > > > Your insight would be much appreciated. > > > > > > > > Thanks, > > > > Phil Yang > > > > > > > > > -----Original Message----- > > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > > > > > Sent: Tuesday, June 23, 2020 4:27 PM > > > > > To: dev@dpdk.org > > > > > Cc: matan@mellanox.com; shahafs@mellanox.com; > > > > > viacheslavo@mellanox.com; Honnappa Nagarahalli > > > > > <Honnappa.Nagarahalli@arm.com>; drc@linux.vnet.ibm.com; nd > > > > > <nd@arm.com> > > > > > Subject: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > > > multi-packet RQ buffer refcnt > > > > > > > > > > Use c11 atomics with explicit ordering instead of the rte_atomic > > > > > ops which enforce unnecessary barriers on aarch64. > > > > > > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > > > --- > <...> > > > > > > > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +- > > > > > drivers/net/mlx5/mlx5_rxtx.c > > > > > | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- > > > > > 3 files changed, 11 insertions(+), 9 deletions(-) > > > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c > > > > > b/drivers/net/mlx5/mlx5_rxq.c index dda0073..7f487f1 100644 > > > > > --- a/drivers/net/mlx5/mlx5_rxq.c > > > > > +++ b/drivers/net/mlx5/mlx5_rxq.c > > > > > @@ -1545,7 +1545,7 @@ mlx5_mprq_buf_init(struct rte_mempool > > *mp, > > > > > void *opaque_arg, > > > > > > > > > > memset(_m, 0, sizeof(*buf)); > > > > > buf->mp = mp; > > > > > -rte_atomic16_set(&buf->refcnt, 1); > > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > for (j = 0; j != strd_n; ++j) { shinfo = &buf->shinfos[j]; > > > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > > > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c > > > > > index > > > > > e4106bf..f0eda88 100644 > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > > > @@ -1595,10 +1595,11 @@ mlx5_mprq_buf_free_cb(void *addr > > > > __rte_unused, > > > > > void *opaque) { > > > > > struct mlx5_mprq_buf *buf = opaque; > > > > > > > > > > -if (rte_atomic16_read(&buf->refcnt) == 1) { > > > > > +if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > > > > > rte_mempool_put(buf->mp, buf); > > > > > -} else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > > > > > -rte_atomic16_set(&buf->refcnt, 1); > > > > > +} else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > > > > > + __ATOMIC_RELAXED) == 0)) { > > > > > +__atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > rte_mempool_put(buf->mp, buf); > > > > > } > > > > > } > > > > > @@ -1678,7 +1679,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, > struct > > > > > rte_mbuf **pkts, uint16_t pkts_n) > > > > > > > > > > if (consumed_strd == strd_n) { > > > > > /* Replace WQE only if the buffer is still in use. */ -if > > > > > (rte_atomic16_read(&buf->refcnt) > 1) { > > > > > +if (__atomic_load_n(&buf->refcnt, > > > > > + __ATOMIC_RELAXED) > 1) { > > > > > mprq_buf_replace(rxq, rq_ci & wq_mask, > > > > strd_n); > > > > > /* Release the old buffer. */ > > > > > mlx5_mprq_buf_free(buf); > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, > struct > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > rte_atomic16_add_return includes a full barrier along with atomic > operation. > > But is full barrier required here? For ex: > > __atomic_add_fetch(&buf->refcnt, 1, > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be > > enough? > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > >refcnt) <= > > > > > - strd_n + 1); > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > __ATOMIC_ACQUIRE); > > The atomic load in MLX5_ASSERT() accesses the same memory space as the > previous __atomic_add_fetch() does. > They will access this memory space in the program order when we enabled > MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch() > becomes unnecessary. > > By changing it to RELAXED ordering, this patch got 7.6% performance > improvement on N1 (making it generate A72 alike instructions). > > Could you please also try it on your testbed, Alex? Situation got better with this modification, here are the results: - no patch: 3.0 Mpps CPU cycles/packet=51.52 - original patch: 2.1 Mpps CPU cycles/packet=71.05 - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found that the degradation is there only in case I enable bursts stats. Could you please turn on the following config options and see if you can reproduce this as well? CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > > > Can you replace just the above line with the following lines and test it? > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > This should make the generated code same as before this patch. Let me > > know if you would prefer us to re-spin the patch instead (for testing). > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > buf_addr = RTE_PTR_SUB(addr, > > > > > RTE_PKTMBUF_HEADROOM); > > > > > /* > > > > > * MLX5 device doesn't use iova but it is necessary in a > > > > diff > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644 > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > > > @@ -78,7 +78,7 @@ struct rxq_zip { > > > > > /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { > > > > > struct rte_mempool *mp; -rte_atomic16_t refcnt; /* Atomically > > > > > accessed refcnt. */ > > > > > +uint16_t refcnt; /* Atomically accessed refcnt. */ > > > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > > > > packet. > > > > > */ > > > > > struct rte_mbuf_ext_shared_info shinfos[]; > > > > > /* > > > > > -- > > > > > 2.7.4 > >
Alexander Kozyrev <akozyrev@mellanox.com> writes: <snip> > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, > > struct > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > rte_atomic16_add_return includes a full barrier along with atomic > > operation. > > > But is full barrier required here? For ex: > > > __atomic_add_fetch(&buf->refcnt, 1, > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that be > > > enough? > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > >refcnt) <= > > > > > > - strd_n + 1); > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > __ATOMIC_ACQUIRE); > > > > The atomic load in MLX5_ASSERT() accesses the same memory space as the > > previous __atomic_add_fetch() does. > > They will access this memory space in the program order when we enabled > > MLX5_PMD_DEBUG. So the ACQUIRE barrier in __atomic_add_fetch() > > becomes unnecessary. > > > > By changing it to RELAXED ordering, this patch got 7.6% performance > > improvement on N1 (making it generate A72 alike instructions). > > > > Could you please also try it on your testbed, Alex? > > Situation got better with this modification, here are the results: > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 > Also, I found that the degradation is there only in case I enable bursts stats. Great! So this patch will not hurt the normal datapath performance. > Could you please turn on the following config options and see if you can > reproduce this as well? > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y Thanks, Alex. Some updates. Slightly (about 1%) throughput degradation was detected after we enabled these two config options on N1 SoC. If we look insight the perf stats results, with this patch, both mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the original code. However, __memcpy_generic takes more cycles. I think that might be the reason for CPU cycles per packet increment after applying this patch. Original code: 98.07%--pkt_burst_io_forward | |--44.53%--__memcpy_generic | |--35.85%--mlx5_rx_burst_mprq | |--15.94%--mlx5_tx_burst_none_empw | | | |--7.32%--mlx5_tx_handle_completion.isra.0 | | | --0.50%--__memcpy_generic | --1.14%--memcpy@plt Use C11 with RELAXED ordering: 99.36%--pkt_burst_io_forward | |--47.40%--__memcpy_generic | |--34.62%--mlx5_rx_burst_mprq | |--15.55%--mlx5_tx_burst_none_empw | | | --7.08%--mlx5_tx_handle_completion.isra.0 | --1.17%--memcpy@plt BTW, all the atomic operations in this patch are not the hotspot. > > > > > > > Can you replace just the above line with the following lines and test it? > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > This should make the generated code same as before this patch. Let me > > > know if you would prefer us to re-spin the patch instead (for testing). > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > buf_addr = RTE_PTR_SUB(addr, > > > > > > RTE_PKTMBUF_HEADROOM); > > > > > > /* > > > > > > * MLX5 device doesn't use iova but it is necessary in a > > > > > diff > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644 > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h <snip> > > >
> Phil Yang <Phil.Yang@arm.com> writes: > > <snip> > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, > > > struct > > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > > rte_atomic16_add_return includes a full barrier along with atomic > > > operation. > > > > But is full barrier required here? For ex: > > > > __atomic_add_fetch(&buf->refcnt, 1, > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would that > > > > be enough? > > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > > >refcnt) <= > > > > > > > - strd_n + 1); > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > > __ATOMIC_ACQUIRE); > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory space as > > > the previous __atomic_add_fetch() does. > > > They will access this memory space in the program order when we > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in > > > __atomic_add_fetch() becomes unnecessary. > > > > > > By changing it to RELAXED ordering, this patch got 7.6% performance > > > improvement on N1 (making it generate A72 alike instructions). > > > > > > Could you please also try it on your testbed, Alex? > > > > Situation got better with this modification, here are the results: > > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found that > > the degradation is there only in case I enable bursts stats. > > > Great! So this patch will not hurt the normal datapath performance. > > > > Could you please turn on the following config options and see if you > > can reproduce this as well? > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > Thanks, Alex. Some updates. > > Slightly (about 1%) throughput degradation was detected after we enabled > these two config options on N1 SoC. > > If we look insight the perf stats results, with this patch, both mlx5_rx_burst > and mlx5_tx_burst consume fewer CPU cycles than the original code. > However, __memcpy_generic takes more cycles. I think that might be the > reason for CPU cycles per packet increment after applying this patch. > > Original code: > 98.07%--pkt_burst_io_forward > | > |--44.53%--__memcpy_generic > | > |--35.85%--mlx5_rx_burst_mprq > | > |--15.94%--mlx5_tx_burst_none_empw > | | > | |--7.32%--mlx5_tx_handle_completion.isra.0 > | | > | --0.50%--__memcpy_generic > | > --1.14%--memcpy@plt > > Use C11 with RELAXED ordering: > 99.36%--pkt_burst_io_forward > | > |--47.40%--__memcpy_generic > | > |--34.62%--mlx5_rx_burst_mprq > | > |--15.55%--mlx5_tx_burst_none_empw > | | > | --7.08%--mlx5_tx_handle_completion.isra.0 > | > --1.17%--memcpy@plt > > BTW, all the atomic operations in this patch are not the hotspot. Phil, we are seeing much worse degradation on our ARM platform unfortunately. I don't think that discrepancy in memcpy can explain this behavior. Your patch is not touching this area of code. Let me collect some perf stat on our side. > > > > > > > > > > > Can you replace just the above line with the following lines and test it? > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > > > This should make the generated code same as before this patch. Let > > > > me know if you would prefer us to re-spin the patch instead (for > testing). > > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > > buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM); > > > > > > > /* > > > > > > > * MLX5 device doesn't use iova but it is necessary in a > > > > > > diff > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 100644 > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > <snip> > > > >
<snip> > > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, > > > > struct > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > > > rte_atomic16_add_return includes a full barrier along with > > > > > atomic > > > > operation. > > > > > But is full barrier required here? For ex: > > > > > __atomic_add_fetch(&buf->refcnt, 1, > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would > > > > > that be enough? > > > > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > > > >refcnt) <= > > > > > > > > - strd_n + 1); > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > > > __ATOMIC_ACQUIRE); > > > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory space as > > > > the previous __atomic_add_fetch() does. > > > > They will access this memory space in the program order when we > > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in > > > > __atomic_add_fetch() becomes unnecessary. > > > > > > > > By changing it to RELAXED ordering, this patch got 7.6% > > > > performance improvement on N1 (making it generate A72 alike > instructions). > > > > > > > > Could you please also try it on your testbed, Alex? > > > > > > Situation got better with this modification, here are the results: > > > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > > > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > > > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found > > > that the degradation is there only in case I enable bursts stats. > > > > > > Great! So this patch will not hurt the normal datapath performance. > > > > > > > Could you please turn on the following config options and see if you > > > can reproduce this as well? > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > > > Thanks, Alex. Some updates. > > > > Slightly (about 1%) throughput degradation was detected after we > > enabled these two config options on N1 SoC. > > > > If we look insight the perf stats results, with this patch, both > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the > original code. > > However, __memcpy_generic takes more cycles. I think that might be the > > reason for CPU cycles per packet increment after applying this patch. > > > > Original code: > > 98.07%--pkt_burst_io_forward > > | > > |--44.53%--__memcpy_generic > > | > > |--35.85%--mlx5_rx_burst_mprq > > | > > |--15.94%--mlx5_tx_burst_none_empw > > | | > > | |--7.32%--mlx5_tx_handle_completion.isra.0 > > | | > > | --0.50%--__memcpy_generic > > | > > --1.14%--memcpy@plt > > > > Use C11 with RELAXED ordering: > > 99.36%--pkt_burst_io_forward > > | > > |--47.40%--__memcpy_generic > > | > > |--34.62%--mlx5_rx_burst_mprq > > | > > |--15.55%--mlx5_tx_burst_none_empw > > | | > > | --7.08%--mlx5_tx_handle_completion.isra.0 > > | > > --1.17%--memcpy@plt > > > > BTW, all the atomic operations in this patch are not the hotspot. > > Phil, we are seeing much worse degradation on our ARM platform > unfortunately. > I don't think that discrepancy in memcpy can explain this behavior. > Your patch is not touching this area of code. Let me collect some perf stat on > our side. Are you testing the patch as is or have you made the changes that were discussed in the thread? > > > > > > > > > > > > > > > > Can you replace just the above line with the following lines and test it? > > > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > > > > > This should make the generated code same as before this patch. > > > > > Let me know if you would prefer us to re-spin the patch instead > > > > > (for > > testing). > > > > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > > > buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM); > > > > > > > > /* > > > > > > > > * MLX5 device doesn't use iova but it is necessary in a > > > > > > > diff > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 > > > > > > > > 100644 > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > <snip> > > > > >
> <snip> > > > > > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void > *dpdk_rxq, > > > > > struct > > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > > > > rte_atomic16_add_return includes a full barrier along with > > > > > > atomic > > > > > operation. > > > > > > But is full barrier required here? For ex: > > > > > > __atomic_add_fetch(&buf->refcnt, 1, > > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would > > > > > > that be enough? > > > > > > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > > > > >refcnt) <= > > > > > > > > > - strd_n + 1); > > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > > > > __ATOMIC_ACQUIRE); > > > > > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory space > > > > > as the previous __atomic_add_fetch() does. > > > > > They will access this memory space in the program order when we > > > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in > > > > > __atomic_add_fetch() becomes unnecessary. > > > > > > > > > > By changing it to RELAXED ordering, this patch got 7.6% > > > > > performance improvement on N1 (making it generate A72 alike > > instructions). > > > > > > > > > > Could you please also try it on your testbed, Alex? > > > > > > > > Situation got better with this modification, here are the results: > > > > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > > > > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > > > > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found > > > > that the degradation is there only in case I enable bursts stats. > > > > > > > > > Great! So this patch will not hurt the normal datapath performance. > > > > > > > > > > Could you please turn on the following config options and see if > > > > you can reproduce this as well? > > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > > > > > Thanks, Alex. Some updates. > > > > > > Slightly (about 1%) throughput degradation was detected after we > > > enabled these two config options on N1 SoC. > > > > > > If we look insight the perf stats results, with this patch, both > > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the > > original code. > > > However, __memcpy_generic takes more cycles. I think that might be > > > the reason for CPU cycles per packet increment after applying this patch. > > > > > > Original code: > > > 98.07%--pkt_burst_io_forward > > > | > > > |--44.53%--__memcpy_generic > > > | > > > |--35.85%--mlx5_rx_burst_mprq > > > | > > > |--15.94%--mlx5_tx_burst_none_empw > > > | | > > > | |--7.32%--mlx5_tx_handle_completion.isra.0 > > > | | > > > | --0.50%--__memcpy_generic > > > | > > > --1.14%--memcpy@plt > > > > > > Use C11 with RELAXED ordering: > > > 99.36%--pkt_burst_io_forward > > > | > > > |--47.40%--__memcpy_generic > > > | > > > |--34.62%--mlx5_rx_burst_mprq > > > | > > > |--15.55%--mlx5_tx_burst_none_empw > > > | | > > > | --7.08%--mlx5_tx_handle_completion.isra.0 > > > | > > > --1.17%--memcpy@plt > > > > > > BTW, all the atomic operations in this patch are not the hotspot. > > > > Phil, we are seeing much worse degradation on our ARM platform > > unfortunately. > > I don't think that discrepancy in memcpy can explain this behavior. > > Your patch is not touching this area of code. Let me collect some perf > > stat on our side. > Are you testing the patch as is or have you made the changes that were > discussed in the thread? > Yes, I made the changes you suggested. It really gets better with them. Could you please respin the patch to make sure I got it right in my environment? > > > > > > > > > > > > > > > > > > > > > Can you replace just the above line with the following lines and test > it? > > > > > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > > > > > > > This should make the generated code same as before this patch. > > > > > > Let me know if you would prefer us to re-spin the patch > > > > > > instead (for > > > testing). > > > > > > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > > > > buf_addr = RTE_PTR_SUB(addr, > RTE_PKTMBUF_HEADROOM); > > > > > > > > > /* > > > > > > > > > * MLX5 device doesn't use iova but it is necessary in > > > > > > > > > a > > > > > > > > diff > > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 > > > > > > > > > 100644 > > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > <snip> > > > > > >
Use c11 atomics with RELAXED ordering instead of the rte_atomic ops which enforce unnecessary barriers on aarch64. Signed-off-by: Phil Yang <phil.yang@arm.com> --- v4: Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa) v3: Split from the patchset: http://patchwork.dpdk.org/cover/68159/ drivers/net/mlx5/mlx5_rxq.c | 2 +- drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- drivers/net/mlx5/mlx5_rxtx.h | 2 +- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index 79eb8f8..40e0239 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void *opaque_arg, memset(_m, 0, sizeof(*buf)); buf->mp = mp; - rte_atomic16_set(&buf->refcnt, 1); + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); for (j = 0; j != strd_n; ++j) { shinfo = &buf->shinfos[j]; shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index 1b71e94..549477b 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr __rte_unused, void *opaque) { struct mlx5_mprq_buf *buf = opaque; - if (rte_atomic16_read(&buf->refcnt) == 1) { + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { rte_mempool_put(buf->mp, buf); - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { - rte_atomic16_set(&buf->refcnt, 1); + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, + __ATOMIC_RELAXED) == 0)) { + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); rte_mempool_put(buf->mp, buf); } } @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) if (consumed_strd == strd_n) { /* Replace WQE only if the buffer is still in use. */ - if (rte_atomic16_read(&buf->refcnt) > 1) { + if (__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) > 1) { mprq_buf_replace(rxq, rq_ci & wq_mask, strd_n); /* Release the old buffer. */ mlx5_mprq_buf_free(buf); @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; /* Increment the refcnt of the whole chunk. */ - rte_atomic16_add_return(&buf->refcnt, 1); - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf->refcnt) <= - strd_n + 1); + __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, + __ATOMIC_RELAXED) <= strd_n + 1); buf_addr = RTE_PTR_SUB(addr, RTE_PKTMBUF_HEADROOM); /* * MLX5 device doesn't use iova but it is necessary in a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index c02a007..467f31d 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -68,7 +68,7 @@ struct rxq_zip { /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf { struct rte_mempool *mp; - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ + uint16_t refcnt; /* Atomically accessed refcnt. */ uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first packet. */ struct rte_mbuf_ext_shared_info shinfos[]; /* -- 2.7.4
<snip> > > > > > > > > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void > > *dpdk_rxq, > > > > > > struct > > > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > > > > > rte_atomic16_add_return includes a full barrier along with > > > > > > > atomic > > > > > > operation. > > > > > > > But is full barrier required here? For ex: > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, > > > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. Would > > > > > > > that be enough? > > > > > > > > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > > > > > >refcnt) <= > > > > > > > > > > - strd_n + 1); > > > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > > > > > __ATOMIC_ACQUIRE); > > > > > > > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory > space > > > > > > as the previous __atomic_add_fetch() does. > > > > > > They will access this memory space in the program order when we > > > > > > enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in > > > > > > __atomic_add_fetch() becomes unnecessary. > > > > > > > > > > > > By changing it to RELAXED ordering, this patch got 7.6% > > > > > > performance improvement on N1 (making it generate A72 alike > > > instructions). > > > > > > > > > > > > Could you please also try it on your testbed, Alex? > > > > > > > > > > Situation got better with this modification, here are the results: > > > > > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > > > > > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > > > > > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I found > > > > > that the degradation is there only in case I enable bursts stats. > > > > > > > > > > > > Great! So this patch will not hurt the normal datapath performance. > > > > > > > > > > > > > Could you please turn on the following config options and see if > > > > > you can reproduce this as well? > > > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > > > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > > > > > > > Thanks, Alex. Some updates. > > > > > > > > Slightly (about 1%) throughput degradation was detected after we > > > > enabled these two config options on N1 SoC. > > > > > > > > If we look insight the perf stats results, with this patch, both > > > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than the > > > original code. > > > > However, __memcpy_generic takes more cycles. I think that might be > > > > the reason for CPU cycles per packet increment after applying this > patch. > > > > > > > > Original code: > > > > 98.07%--pkt_burst_io_forward > > > > | > > > > |--44.53%--__memcpy_generic > > > > | > > > > |--35.85%--mlx5_rx_burst_mprq > > > > | > > > > |--15.94%--mlx5_tx_burst_none_empw > > > > | | > > > > | |--7.32%--mlx5_tx_handle_completion.isra.0 > > > > | | > > > > | --0.50%--__memcpy_generic > > > > | > > > > --1.14%--memcpy@plt > > > > > > > > Use C11 with RELAXED ordering: > > > > 99.36%--pkt_burst_io_forward > > > > | > > > > |--47.40%--__memcpy_generic > > > > | > > > > |--34.62%--mlx5_rx_burst_mprq > > > > | > > > > |--15.55%--mlx5_tx_burst_none_empw > > > > | | > > > > | --7.08%--mlx5_tx_handle_completion.isra.0 > > > > | > > > > --1.17%--memcpy@plt > > > > > > > > BTW, all the atomic operations in this patch are not the hotspot. > > > > > > Phil, we are seeing much worse degradation on our ARM platform > > > unfortunately. > > > I don't think that discrepancy in memcpy can explain this behavior. > > > Your patch is not touching this area of code. Let me collect some perf > > > stat on our side. > > Are you testing the patch as is or have you made the changes that were > > discussed in the thread? > > > > Yes, I made the changes you suggested. It really gets better with them. > Could you please respin the patch to make sure I got it right in my > environment? Thanks, Alex. Please check the new version here. http://patchwork.dpdk.org/patch/76335/ > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you replace just the above line with the following lines and > test > > it? > > > > > > > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > > > > > > > > > This should make the generated code same as before this patch. > > > > > > > Let me know if you would prefer us to re-spin the patch > > > > > > > instead (for > > > > testing). > > > > > > > > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > > > > > buf_addr = RTE_PTR_SUB(addr, > > RTE_PKTMBUF_HEADROOM); > > > > > > > > > > /* > > > > > > > > > > * MLX5 device doesn't use iova but it is necessary in > > > > > > > > > > a > > > > > > > > > diff > > > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index 26621ff..0fc15f3 > > > > > > > > > > 100644 > > > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > > <snip> > > > > > > >
> <snip> > > > > > > > > > > > > > > > > > > > @@ -1790,9 +1792,9 @@ mlx5_rx_burst_mprq(void > > > *dpdk_rxq, > > > > > > > struct > > > > > > > > > > > rte_mbuf **pkts, uint16_t pkts_n) void *buf_addr; > > > > > > > > > > > > > > > > > > > > > > /* Increment the refcnt of the whole chunk. */ > > > > > > > > > > > -rte_atomic16_add_return(&buf->refcnt, 1); > > > > > > > > rte_atomic16_add_return includes a full barrier along with > > > > > > > > atomic > > > > > > > operation. > > > > > > > > But is full barrier required here? For ex: > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, > > > > > > > > __ATOMIC_RELAXED) will offer atomicity, but no barrier. > > > > > > > > Would that be enough? > > > > > > > > > > > > > > > > > > > -MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > > > > > > > > > > >refcnt) <= > > > > > > > > > > > - strd_n + 1); > > > > > > > > > > > +__atomic_add_fetch(&buf->refcnt, 1, > > > > > > > > > > > __ATOMIC_ACQUIRE); > > > > > > > > > > > > > > The atomic load in MLX5_ASSERT() accesses the same memory > > space > > > > > > > as the previous __atomic_add_fetch() does. > > > > > > > They will access this memory space in the program order when > > > > > > > we enabled MLX5_PMD_DEBUG. So the ACQUIRE barrier in > > > > > > > __atomic_add_fetch() becomes unnecessary. > > > > > > > > > > > > > > By changing it to RELAXED ordering, this patch got 7.6% > > > > > > > performance improvement on N1 (making it generate A72 alike > > > > instructions). > > > > > > > > > > > > > > Could you please also try it on your testbed, Alex? > > > > > > > > > > > > Situation got better with this modification, here are the results: > > > > > > - no patch: 3.0 Mpps CPU cycles/packet=51.52 > > > > > > - original patch: 2.1 Mpps CPU cycles/packet=71.05 > > > > > > - modified patch: 2.9 Mpps CPU cycles/packet=52.79 Also, I > > > > > > found that the degradation is there only in case I enable bursts stats. > > > > > > > > > > > > > > > Great! So this patch will not hurt the normal datapath performance. > > > > > > > > > > > > > > > > Could you please turn on the following config options and see > > > > > > if you can reproduce this as well? > > > > > > CONFIG_RTE_TEST_PMD_RECORD_CORE_CYCLES=y > > > > > > CONFIG_RTE_TEST_PMD_RECORD_BURST_STATS=y > > > > > > > > > > Thanks, Alex. Some updates. > > > > > > > > > > Slightly (about 1%) throughput degradation was detected after we > > > > > enabled these two config options on N1 SoC. > > > > > > > > > > If we look insight the perf stats results, with this patch, both > > > > > mlx5_rx_burst and mlx5_tx_burst consume fewer CPU cycles than > > > > > the > > > > original code. > > > > > However, __memcpy_generic takes more cycles. I think that might > > > > > be the reason for CPU cycles per packet increment after applying > > > > > this > > patch. > > > > > > > > > > Original code: > > > > > 98.07%--pkt_burst_io_forward > > > > > | > > > > > |--44.53%--__memcpy_generic > > > > > | > > > > > |--35.85%--mlx5_rx_burst_mprq > > > > > | > > > > > |--15.94%--mlx5_tx_burst_none_empw > > > > > | | > > > > > | |--7.32%--mlx5_tx_handle_completion.isra.0 > > > > > | | > > > > > | --0.50%--__memcpy_generic > > > > > | > > > > > --1.14%--memcpy@plt > > > > > > > > > > Use C11 with RELAXED ordering: > > > > > 99.36%--pkt_burst_io_forward > > > > > | > > > > > |--47.40%--__memcpy_generic > > > > > | > > > > > |--34.62%--mlx5_rx_burst_mprq > > > > > | > > > > > |--15.55%--mlx5_tx_burst_none_empw > > > > > | | > > > > > | --7.08%--mlx5_tx_handle_completion.isra.0 > > > > > | > > > > > --1.17%--memcpy@plt > > > > > > > > > > BTW, all the atomic operations in this patch are not the hotspot. > > > > > > > > Phil, we are seeing much worse degradation on our ARM platform > > > > unfortunately. > > > > I don't think that discrepancy in memcpy can explain this behavior. > > > > Your patch is not touching this area of code. Let me collect some > > > > perf stat on our side. > > > Are you testing the patch as is or have you made the changes that > > > were discussed in the thread? > > > > > > > Yes, I made the changes you suggested. It really gets better with them. > > Could you please respin the patch to make sure I got it right in my > > environment? > > Thanks, Alex. > Please check the new version here. > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatchwo > rk.dpdk.org%2Fpatch%2F76335%2F&data=02%7C01%7Cakozyrev%40nvidi > a.com%7C2486830050214bac8b9708d84fb4d9f9%7C43083d15727340c1b7db39 > efd9ccc17a%7C0%7C0%7C637346985463620568&sdata=WGw0JZPcbjosSiI > UxJuQz3r2pZBYkz%2BIXSqlOXimZdc%3D&reserved=0 This patch is definitely better, do not see a degradation anymore, thank you. Acked-by: Alexander Kozyrev <akozyrev@nvidia.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you replace just the above line with the following > > > > > > > > lines and > > test > > > it? > > > > > > > > > > > > > > > > __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_RELAXED); > > > > > > > > __atomic_thread_fence(__ATOMIC_ACQ_REL); > > > > > > > > > > > > > > > > This should make the generated code same as before this patch. > > > > > > > > Let me know if you would prefer us to re-spin the patch > > > > > > > > instead (for > > > > > testing). > > > > > > > > > > > > > > > > > > > +MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > > > > > > > > > > + __ATOMIC_RELAXED) <= strd_n + 1); > > > > > > > > > > > buf_addr = RTE_PTR_SUB(addr, > > > RTE_PKTMBUF_HEADROOM); > > > > > > > > > > > /* > > > > > > > > > > > * MLX5 device doesn't use iova but it is necessary > > > > > > > > > > > in a > > > > > > > > > > diff > > > > > > > > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > > > b/drivers/net/mlx5/mlx5_rxtx.h index > > > > > > > > > > > 26621ff..0fc15f3 > > > > > > > > > > > 100644 > > > > > > > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > > > > <snip> > > > > > > > >
<snip> > > Use c11 atomics with RELAXED ordering instead of the rte_atomic ops which > enforce unnecessary barriers on aarch64. > > Signed-off-by: Phil Yang <phil.yang@arm.com> Looks good. Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> > --- > v4: > Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa) > > v3: > Split from the patchset: > http://patchwork.dpdk.org/cover/68159/ > > drivers/net/mlx5/mlx5_rxq.c | 2 +- > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- > drivers/net/mlx5/mlx5_rxtx.h | 2 +- > 3 files changed, 11 insertions(+), 9 deletions(-) > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index > 79eb8f8..40e0239 100644 > --- a/drivers/net/mlx5/mlx5_rxq.c > +++ b/drivers/net/mlx5/mlx5_rxq.c > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, > void *opaque_arg, > > memset(_m, 0, sizeof(*buf)); > buf->mp = mp; > - rte_atomic16_set(&buf->refcnt, 1); > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > for (j = 0; j != strd_n; ++j) { > shinfo = &buf->shinfos[j]; > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index > 1b71e94..549477b 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.c > +++ b/drivers/net/mlx5/mlx5_rxtx.c > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr > __rte_unused, void *opaque) { > struct mlx5_mprq_buf *buf = opaque; > > - if (rte_atomic16_read(&buf->refcnt) == 1) { > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > rte_mempool_put(buf->mp, buf); > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > - rte_atomic16_set(&buf->refcnt, 1); > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > + __ATOMIC_RELAXED) == 0)) { > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > rte_mempool_put(buf->mp, buf); > } > } > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > rte_mbuf **pkts, uint16_t pkts_n) > > if (consumed_strd == strd_n) { > /* Replace WQE only if the buffer is still in use. */ > - if (rte_atomic16_read(&buf->refcnt) > 1) { > + if (__atomic_load_n(&buf->refcnt, > + __ATOMIC_RELAXED) > 1) { > mprq_buf_replace(rxq, rq_ci & wq_mask, > strd_n); > /* Release the old buffer. */ > mlx5_mprq_buf_free(buf); > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > rte_mbuf **pkts, uint16_t pkts_n) > void *buf_addr; > > /* Increment the refcnt of the whole chunk. */ > - rte_atomic16_add_return(&buf->refcnt, 1); > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > >refcnt) <= > - strd_n + 1); > + __atomic_add_fetch(&buf->refcnt, 1, > __ATOMIC_RELAXED); > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > + __ATOMIC_RELAXED) <= strd_n + 1); > buf_addr = RTE_PTR_SUB(addr, > RTE_PKTMBUF_HEADROOM); > /* > * MLX5 device doesn't use iova but it is necessary in > a diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > index c02a007..467f31d 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.h > +++ b/drivers/net/mlx5/mlx5_rxtx.h > @@ -68,7 +68,7 @@ struct rxq_zip { > /* Multi-Packet RQ buffer header. */ > struct mlx5_mprq_buf { > struct rte_mempool *mp; > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ > + uint16_t refcnt; /* Atomically accessed refcnt. */ > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > packet. */ > struct rte_mbuf_ext_shared_info shinfos[]; > /* > -- > 2.7.4
<snip>
> > > > > Phil, we are seeing much worse degradation on our ARM platform
> > > > > unfortunately.
> > > > > I don't think that discrepancy in memcpy can explain this behavior.
> > > > > Your patch is not touching this area of code. Let me collect
> > > > > some perf stat on our side.
> > > > Are you testing the patch as is or have you made the changes that
> > > > were discussed in the thread?
> > > >
> > >
> > > Yes, I made the changes you suggested. It really gets better with them.
> > > Could you please respin the patch to make sure I got it right in my
> > > environment?
> >
> > Thanks, Alex.
> > Please check the new version here.
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > wo
> rk.dpdk.org%2Fpatch%2F76335%2F&data=02%7C01%7Cakozyrev%40nv
> idi
> >
> a.com%7C2486830050214bac8b9708d84fb4d9f9%7C43083d15727340c1b7d
> b39
> >
> efd9ccc17a%7C0%7C0%7C637346985463620568&sdata=WGw0JZPcbjos
> SiI
> > UxJuQz3r2pZBYkz%2BIXSqlOXimZdc%3D&reserved=0
>
> This patch is definitely better, do not see a degradation anymore, thank you.
>
> Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
Can you please add your Acked-by tag in v4 (Phil provided the link above)?
<snip>
> <snip> > > > > > Use c11 atomics with RELAXED ordering instead of the rte_atomic ops > > which enforce unnecessary barriers on aarch64. > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > Looks good. > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Acked-by: Alexander Kozyrev <akozyrev@nvidia.com> > > > --- > > v4: > > Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa) > > > > v3: > > Split from the patchset: > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch > > > work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%40nv > idia. > > > com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db39ef > d9ccc > > > 17a%7C0%7C0%7C637352982762038088&sdata=0HzTxbzh0Dqk0hZ5PIgEV > zieyV% > > 2BnLTivsVIFFxXFAtI%3D&reserved=0 > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +- > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++------- > > drivers/net/mlx5/mlx5_rxtx.h | 2 +- > > 3 files changed, 11 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c > > index > > 79eb8f8..40e0239 100644 > > --- a/drivers/net/mlx5/mlx5_rxq.c > > +++ b/drivers/net/mlx5/mlx5_rxq.c > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp, void > > *opaque_arg, > > > > memset(_m, 0, sizeof(*buf)); > > buf->mp = mp; > > - rte_atomic16_set(&buf->refcnt, 1); > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > for (j = 0; j != strd_n; ++j) { > > shinfo = &buf->shinfos[j]; > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index > > 1b71e94..549477b 100644 > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr > __rte_unused, > > void *opaque) { > > struct mlx5_mprq_buf *buf = opaque; > > > > - if (rte_atomic16_read(&buf->refcnt) == 1) { > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) { > > rte_mempool_put(buf->mp, buf); > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) { > > - rte_atomic16_set(&buf->refcnt, 1); > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1, > > + __ATOMIC_RELAXED) == 0)) { > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED); > > rte_mempool_put(buf->mp, buf); > > } > > } > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > rte_mbuf **pkts, uint16_t pkts_n) > > > > if (consumed_strd == strd_n) { > > /* Replace WQE only if the buffer is still in use. */ > > - if (rte_atomic16_read(&buf->refcnt) > 1) { > > + if (__atomic_load_n(&buf->refcnt, > > + __ATOMIC_RELAXED) > 1) { > > mprq_buf_replace(rxq, rq_ci & wq_mask, > strd_n); > > /* Release the old buffer. */ > > mlx5_mprq_buf_free(buf); > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct > > rte_mbuf **pkts, uint16_t pkts_n) > > void *buf_addr; > > > > /* Increment the refcnt of the whole chunk. */ > > - rte_atomic16_add_return(&buf->refcnt, 1); > > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf- > > >refcnt) <= > > - strd_n + 1); > > + __atomic_add_fetch(&buf->refcnt, 1, > > __ATOMIC_RELAXED); > > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt, > > + __ATOMIC_RELAXED) <= strd_n + 1); > > buf_addr = RTE_PTR_SUB(addr, > > RTE_PKTMBUF_HEADROOM); > > /* > > * MLX5 device doesn't use iova but it is necessary in a > diff > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h > > index c02a007..467f31d 100644 > > --- a/drivers/net/mlx5/mlx5_rxtx.h > > +++ b/drivers/net/mlx5/mlx5_rxtx.h > > @@ -68,7 +68,7 @@ struct rxq_zip { > > /* Multi-Packet RQ buffer header. */ > > struct mlx5_mprq_buf { > > struct rte_mempool *mp; > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */ > > + uint16_t refcnt; /* Atomically accessed refcnt. */ > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first > packet. > > */ > > struct rte_mbuf_ext_shared_info shinfos[]; > > /* > > -- > > 2.7.4
Hi Raslan,
It seems that there are no more comments for this patch.
So shall we proceed further?
Thanks,
Phil Yang
> -----Original Message-----
> From: Alexander Kozyrev <akozyrev@nvidia.com>
> Sent: Thursday, September 10, 2020 9:37 AM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> <Phil.Yang@arm.com>; akozyrev@mellanox.com; rasland@mellanox.com;
> dev@dpdk.org
> Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf Shuler
> <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd <nd@arm.com>;
> nd <nd@arm.com>
> Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> buffer refcnt
>
> > <snip>
> >
> > >
> > > Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
> > > which enforce unnecessary barriers on aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Looks good.
> >
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
> Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
>
> >
> > > ---
> > > v4:
> > > Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
> > >
> > > v3:
> > > Split from the patchset:
> > >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%40
> nv
> > idia.
> > >
> >
> com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db39e
> f
> > d9ccc
> > >
> >
> 17a%7C0%7C0%7C637352982762038088&sdata=0HzTxbzh0Dqk0hZ5PIgE
> V
> > zieyV%
> > > 2BnLTivsVIFFxXFAtI%3D&reserved=0
> > >
> > > drivers/net/mlx5/mlx5_rxq.c | 2 +-
> > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> > > 3 files changed, 11 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
> > > index
> > > 79eb8f8..40e0239 100644
> > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool *mp,
> void
> > > *opaque_arg,
> > >
> > > memset(_m, 0, sizeof(*buf));
> > > buf->mp = mp;
> > > - rte_atomic16_set(&buf->refcnt, 1);
> > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > for (j = 0; j != strd_n; ++j) {
> > > shinfo = &buf->shinfos[j];
> > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index
> > > 1b71e94..549477b 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > __rte_unused,
> > > void *opaque) {
> > > struct mlx5_mprq_buf *buf = opaque;
> > >
> > > - if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > > rte_mempool_put(buf->mp, buf);
> > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > - rte_atomic16_set(&buf->refcnt, 1);
> > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > + __ATOMIC_RELAXED) == 0)) {
> > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > rte_mempool_put(buf->mp, buf);
> > > }
> > > }
> > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > >
> > > if (consumed_strd == strd_n) {
> > > /* Replace WQE only if the buffer is still in use. */
> > > - if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > + if (__atomic_load_n(&buf->refcnt,
> > > + __ATOMIC_RELAXED) > 1) {
> > > mprq_buf_replace(rxq, rq_ci & wq_mask,
> > strd_n);
> > > /* Release the old buffer. */
> > > mlx5_mprq_buf_free(buf);
> > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > rte_mbuf **pkts, uint16_t pkts_n)
> > > void *buf_addr;
> > >
> > > /* Increment the refcnt of the whole chunk. */
> > > - rte_atomic16_add_return(&buf->refcnt, 1);
> > > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > >refcnt) <=
> > > - strd_n + 1);
> > > + __atomic_add_fetch(&buf->refcnt, 1,
> > > __ATOMIC_RELAXED);
> > > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > + __ATOMIC_RELAXED) <= strd_n + 1);
> > > buf_addr = RTE_PTR_SUB(addr,
> > > RTE_PKTMBUF_HEADROOM);
> > > /*
> > > * MLX5 device doesn't use iova but it is necessary in
> a
> > diff
> > > --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
> > > index c02a007..467f31d 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > > /* Multi-Packet RQ buffer header. */
> > > struct mlx5_mprq_buf {
> > > struct rte_mempool *mp;
> > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > + uint16_t refcnt; /* Atomically accessed refcnt. */
> > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > packet.
> > > */
> > > struct rte_mbuf_ext_shared_info shinfos[];
> > > /*
> > > --
> > > 2.7.4
Looks good to me, and we've checked the performance has no impact.
Thank you.
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, September 29, 2020 18:23
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-
> packet RQ buffer refcnt
>
> Hi Raslan,
>
> It seems that there are no more comments for this patch.
> So shall we proceed further?
>
> Thanks,
> Phil Yang
>
> > -----Original Message-----
> > From: Alexander Kozyrev <akozyrev@nvidia.com>
> > Sent: Thursday, September 10, 2020 9:37 AM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> > <Phil.Yang@arm.com>; akozyrev@mellanox.com; rasland@mellanox.com;
> > dev@dpdk.org
> > Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf Shuler
> > <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd
> <nd@arm.com>; nd
> > <nd@arm.com>
> > Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> > buffer refcnt
> >
> > > <snip>
> > >
> > > >
> > > > Use c11 atomics with RELAXED ordering instead of the rte_atomic
> > > > ops which enforce unnecessary barriers on aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Looks good.
> > >
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
> >
> > >
> > > > ---
> > > > v4:
> > > > Remove the unnecessary ACQUIRE barrier in rx burst path.
> > > > (Honnappa)
> > > >
> > > > v3:
> > > > Split from the patchset:
> > > >
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > > >
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%4
> 0
> > nv
> > > idia.
> > > >
> > >
> >
> com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db3
> 9e
> > f
> > > d9ccc
> > > >
> > >
> >
> 17a%7C0%7C0%7C637352982762038088&sdata=0HzTxbzh0Dqk0hZ5PI
> gE
> > V
> > > zieyV%
> > > > 2BnLTivsVIFFxXFAtI%3D&reserved=0
> > > >
> > > > drivers/net/mlx5/mlx5_rxq.c | 2 +-
> > > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > > drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> > > > 3 files changed, 11 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > b/drivers/net/mlx5/mlx5_rxq.c index
> > > > 79eb8f8..40e0239 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> *mp,
> > void
> > > > *opaque_arg,
> > > >
> > > > memset(_m, 0, sizeof(*buf));
> > > > buf->mp = mp;
> > > > - rte_atomic16_set(&buf->refcnt, 1);
> > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > for (j = 0; j != strd_n; ++j) {
> > > > shinfo = &buf->shinfos[j];
> > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> > > > index 1b71e94..549477b 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > __rte_unused,
> > > > void *opaque) {
> > > > struct mlx5_mprq_buf *buf = opaque;
> > > >
> > > > - if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) == 1) {
> > > > rte_mempool_put(buf->mp, buf);
> > > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > - rte_atomic16_set(&buf->refcnt, 1);
> > > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > + __ATOMIC_RELAXED) == 0)) {
> > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > rte_mempool_put(buf->mp, buf);
> > > > }
> > > > }
> > > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > >
> > > > if (consumed_strd == strd_n) {
> > > > /* Replace WQE only if the buffer is still in use. */
> > > > - if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > + if (__atomic_load_n(&buf->refcnt,
> > > > + __ATOMIC_RELAXED) > 1) {
> > > > mprq_buf_replace(rxq, rq_ci & wq_mask,
> > > strd_n);
> > > > /* Release the old buffer. */
> > > > mlx5_mprq_buf_free(buf);
> > > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq, struct
> > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > void *buf_addr;
> > > >
> > > > /* Increment the refcnt of the whole chunk. */
> > > > - rte_atomic16_add_return(&buf->refcnt, 1);
> > > > - MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > >refcnt) <=
> > > > - strd_n + 1);
> > > > + __atomic_add_fetch(&buf->refcnt, 1,
> > > > __ATOMIC_RELAXED);
> > > > + MLX5_ASSERT(__atomic_load_n(&buf->refcnt,
> > > > + __ATOMIC_RELAXED) <= strd_n + 1);
> > > > buf_addr = RTE_PTR_SUB(addr,
> > > > RTE_PKTMBUF_HEADROOM);
> > > > /*
> > > > * MLX5 device doesn't use iova but it is necessary in
> > a
> > > diff
> > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > b/drivers/net/mlx5/mlx5_rxtx.h index c02a007..467f31d 100644
> > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > > > /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf {
> > > > struct rte_mempool *mp;
> > > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > > + uint16_t refcnt; /* Atomically accessed refcnt. */
> > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for the first
> > > packet.
> > > > */
> > > > struct rte_mbuf_ext_shared_info shinfos[];
> > > > /*
> > > > --
> > > > 2.7.4
Sure, I'll take the patch since I was waiting for performance impact testing, and since Slava just confirmed it,
Then we can take it yes thank you for your contribution.
Kindest regards,
Raslan Darawsheh
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Wednesday, September 30, 2020 3:45 PM
> To: Phil Yang <Phil.Yang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> nd <nd@arm.com>
> Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> buffer refcnt
>
> Looks good to me, and we've checked the performance has no impact.
> Thank you.
>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
>
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> > Sent: Tuesday, September 29, 2020 18:23
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > Cc: nd <nd@arm.com>; Alexander Kozyrev <akozyrev@nvidia.com>;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org;
> > nd <nd@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v4] net/mlx5: relaxed ordering for multi-
> > packet RQ buffer refcnt
> >
> > Hi Raslan,
> >
> > It seems that there are no more comments for this patch.
> > So shall we proceed further?
> >
> > Thanks,
> > Phil Yang
> >
> > > -----Original Message-----
> > > From: Alexander Kozyrev <akozyrev@nvidia.com>
> > > Sent: Thursday, September 10, 2020 9:37 AM
> > > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang
> > > <Phil.Yang@arm.com>; akozyrev@mellanox.com;
> rasland@mellanox.com;
> > > dev@dpdk.org
> > > Cc: Phil Yang <Phil.Yang@arm.com>; matan@mellanox.com; Shahaf
> Shuler
> > > <shahafs@mellanox.com>; viacheslavo@mellanox.com; nd
> > <nd@arm.com>; nd
> > > <nd@arm.com>
> > > Subject: RE: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ
> > > buffer refcnt
> > >
> > > > <snip>
> > > >
> > > > >
> > > > > Use c11 atomics with RELAXED ordering instead of the rte_atomic
> > > > > ops which enforce unnecessary barriers on aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Looks good.
> > > >
> > > > Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> > >
> > > Acked-by: Alexander Kozyrev <akozyrev@nvidia.com>
> > >
> > > >
> > > > > ---
> > > > > v4:
> > > > > Remove the unnecessary ACQUIRE barrier in rx burst path.
> > > > > (Honnappa)
> > > > >
> > > > > v3:
> > > > > Split from the patchset:
> > > > >
> > >
> >
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> > > > >
> > > >
> > >
> >
> work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Cakozyrev%4
> > 0
> > > nv
> > > > idia.
> > > > >
> > > >
> > >
> > com%7Cf16ba4e8cfb145f5d82008d85529348e%7C43083d15727340c1b7db3
> > 9e
> > > f
> > > > d9ccc
> > > > >
> > > >
> > >
> > 17a%7C0%7C0%7C637352982762038088&sdata=0HzTxbzh0Dqk0hZ5PI
> > gE
> > > V
> > > > zieyV%
> > > > > 2BnLTivsVIFFxXFAtI%3D&reserved=0
> > > > >
> > > > > drivers/net/mlx5/mlx5_rxq.c | 2 +-
> > > > > drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> > > > > drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> > > > > 3 files changed, 11 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_rxq.c
> > > > > b/drivers/net/mlx5/mlx5_rxq.c index
> > > > > 79eb8f8..40e0239 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxq.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxq.c
> > > > > @@ -2012,7 +2012,7 @@ mlx5_mprq_buf_init(struct rte_mempool
> > *mp,
> > > void
> > > > > *opaque_arg,
> > > > >
> > > > > memset(_m, 0, sizeof(*buf));
> > > > > buf->mp = mp;
> > > > > - rte_atomic16_set(&buf->refcnt, 1);
> > > > > + __atomic_store_n(&buf->refcnt, 1, __ATOMIC_RELAXED);
> > > > > for (j = 0; j != strd_n; ++j) {
> > > > > shinfo = &buf->shinfos[j];
> > > > > shinfo->free_cb = mlx5_mprq_buf_free_cb; diff --git
> > > > > a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > index 1b71e94..549477b 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c
> > > > > @@ -1626,10 +1626,11 @@ mlx5_mprq_buf_free_cb(void *addr
> > > > __rte_unused,
> > > > > void *opaque) {
> > > > > struct mlx5_mprq_buf *buf = opaque;
> > > > >
> > > > > - if (rte_atomic16_read(&buf->refcnt) == 1) {
> > > > > + if (__atomic_load_n(&buf->refcnt, __ATOMIC_RELAXED) ==
> 1) {
> > > > > rte_mempool_put(buf->mp, buf);
> > > > > - } else if (rte_atomic16_add_return(&buf->refcnt, -1) == 0) {
> > > > > - rte_atomic16_set(&buf->refcnt, 1);
> > > > > + } else if (unlikely(__atomic_sub_fetch(&buf->refcnt, 1,
> > > > > + __ATOMIC_RELAXED) ==
> 0)) {
> > > > > + __atomic_store_n(&buf->refcnt, 1,
> __ATOMIC_RELAXED);
> > > > > rte_mempool_put(buf->mp, buf);
> > > > > }
> > > > > }
> > > > > @@ -1709,7 +1710,8 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > >
> > > > > if (consumed_strd == strd_n) {
> > > > > /* Replace WQE only if the buffer is still in
> use. */
> > > > > - if (rte_atomic16_read(&buf->refcnt) > 1) {
> > > > > + if (__atomic_load_n(&buf->refcnt,
> > > > > + __ATOMIC_RELAXED) > 1) {
> > > > > mprq_buf_replace(rxq, rq_ci &
> wq_mask,
> > > > strd_n);
> > > > > /* Release the old buffer. */
> > > > > mlx5_mprq_buf_free(buf);
> > > > > @@ -1821,9 +1823,9 @@ mlx5_rx_burst_mprq(void *dpdk_rxq,
> struct
> > > > > rte_mbuf **pkts, uint16_t pkts_n)
> > > > > void *buf_addr;
> > > > >
> > > > > /* Increment the refcnt of the whole chunk.
> */
> > > > > - rte_atomic16_add_return(&buf->refcnt, 1);
> > > > > -
> MLX5_ASSERT((uint16_t)rte_atomic16_read(&buf-
> > > > > >refcnt) <=
> > > > > - strd_n + 1);
> > > > > + __atomic_add_fetch(&buf->refcnt, 1,
> > > > > __ATOMIC_RELAXED);
> > > > > + MLX5_ASSERT(__atomic_load_n(&buf-
> >refcnt,
> > > > > + __ATOMIC_RELAXED) <= strd_n +
> 1);
> > > > > buf_addr = RTE_PTR_SUB(addr,
> > > > > RTE_PKTMBUF_HEADROOM);
> > > > > /*
> > > > > * MLX5 device doesn't use iova but it is
> necessary in
> > > a
> > > > diff
> > > > > --git a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > b/drivers/net/mlx5/mlx5_rxtx.h index c02a007..467f31d 100644
> > > > > --- a/drivers/net/mlx5/mlx5_rxtx.h
> > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.h
> > > > > @@ -68,7 +68,7 @@ struct rxq_zip {
> > > > > /* Multi-Packet RQ buffer header. */ struct mlx5_mprq_buf {
> > > > > struct rte_mempool *mp;
> > > > > - rte_atomic16_t refcnt; /* Atomically accessed refcnt. */
> > > > > + uint16_t refcnt; /* Atomically accessed refcnt. */
> > > > > uint8_t pad[RTE_PKTMBUF_HEADROOM]; /* Headroom for
> the first
> > > > packet.
> > > > > */
> > > > > struct rte_mbuf_ext_shared_info shinfos[];
> > > > > /*
> > > > > --
> > > > > 2.7.4
Hi,
> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Thursday, September 3, 2020 5:53 AM
> To: akozyrev@mellanox.com; rasland@mellanox.com; dev@dpdk.org
> Cc: Honnappa.Nagarahalli@arm.com; Phil.Yang@arm.com;
> matan@mellanox.com; Shahaf Shuler <shahafs@mellanox.com>;
> viacheslavo@mellanox.com; nd@arm.com
> Subject: [PATCH v4] net/mlx5: relaxed ordering for multi-packet RQ buffer
> refcnt
>
> Use c11 atomics with RELAXED ordering instead of the rte_atomic ops
> which enforce unnecessary barriers on aarch64.
>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> ---
> v4:
> Remove the unnecessary ACQUIRE barrier in rx burst path. (Honnappa)
>
> v3:
> Split from the patchset:
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpatch
> work.dpdk.org%2Fcover%2F68159%2F&data=02%7C01%7Crasland%40n
> vidia.com%7Ca89037e982b84575f58808d84fb493d9%7C43083d15727340c1b7d
> b39efd9ccc17a%7C0%7C0%7C637346984280298342&sdata=hjWErvvV%2
> Fg%2Bpb%2FMMzepHuRK0Weftf%2BFC2NNWsyoWSyo%3D&reserved
> =0
>
> drivers/net/mlx5/mlx5_rxq.c | 2 +-
> drivers/net/mlx5/mlx5_rxtx.c | 16 +++++++++-------
> drivers/net/mlx5/mlx5_rxtx.h | 2 +-
> 3 files changed, 11 insertions(+), 9 deletions(-)
>
Patch applied to next-net-mlx,
Kindest regards,
Raslan Darawsheh