DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers
@ 2017-12-27  4:28 Yongseok Koh
  2017-12-27  4:28 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
                   ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: Yongseok Koh @ 2017-12-27  4:28 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, Thomas Speier

Instead of using system-wide 'dsb' instruction for IO barriers, 'dmb' is
sufficient and could bring better performance. Using 'dmb' with Outer
Shareable Domain option is also consistent with linux kernel.

Cc: Thomas Speier <tspeier@qti.qualcomm.com>

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 0b70d6209..8dcce6054 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -58,11 +58,11 @@ extern "C" {
 
 #define rte_smp_rmb() dmb(ishld)
 
-#define rte_io_mb() rte_mb()
+#define rte_io_mb() dmb(osh)
 
-#define rte_io_wmb() rte_wmb()
+#define rte_io_wmb() dmb(oshst)
 
-#define rte_io_rmb() rte_rmb()
+#define rte_io_rmb() dmb(oshld)
 
 #ifdef __cplusplus
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH 2/2] net/mlx5: fix synchonization on polling Rx completions
  2017-12-27  4:28 [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Yongseok Koh
@ 2017-12-27  4:28 ` Yongseok Koh
  2018-01-04 12:58 ` [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Jerin Jacob
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2017-12-27  4:28 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, stable

Polling a new packet is basically sensing the generation bit in a
completion entry. For some processors not having strongly-ordered memory
model, there has to be an IO memory barrier between reading the generation
bit and other fields of the entry in order to guarantee data is not stale.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c          |  1 +
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 53 ++++++++++++++++++++---------------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h  |  2 +-
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 28c0ad8ab..ad7545e3c 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1674,6 +1674,7 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 			return 0;
 		++rxq->cq_ci;
 		op_own = cqe->op_own;
+		rte_io_rmb();
 		if (MLX5_CQE_FORMAT(op_own) == MLX5_COMPRESSED) {
 			volatile struct mlx5_mini_cqe8 (*mc)[8] =
 				(volatile struct mlx5_mini_cqe8 (*)[8])
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index 77ce0c3e0..39b7b1953 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -798,6 +798,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		uint16x4_t mask;
 		uint16x4_t byte_cnt;
 		uint32x4_t ptype_info, flow_tag;
+		register uint64x2_t c0, c1, c2, c3;
 		uint8_t *p0, *p1, *p2, *p3;
 		uint8_t *e0 = (void *)&elts[pos]->pkt_len;
 		uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len;
@@ -814,6 +815,16 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe);
 		p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe);
 		p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe);
+		/* B.0 (CQE 3) load a block having op_own. */
+		c3 = vld1q_u64((uint64_t *)(p3 + 48));
+		/* B.0 (CQE 2) load a block having op_own. */
+		c2 = vld1q_u64((uint64_t *)(p2 + 48));
+		/* B.0 (CQE 1) load a block having op_own. */
+		c1 = vld1q_u64((uint64_t *)(p1 + 48));
+		/* B.0 (CQE 0) load a block having op_own. */
+		c0 = vld1q_u64((uint64_t *)(p0 + 48));
+		/* Synchronize for loading the rest of blocks. */
+		rte_io_rmb();
 		/* Prefetch next 4 CQEs. */
 		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
 			unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;
@@ -823,50 +834,46 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			rte_prefetch_non_temporal(&cq[next + 3]);
 		}
 		__asm__ volatile (
-		/* B.1 (CQE 3) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p3]] \n\t"
-		"sub %[p3], %[p3], #48 \n\t"
-		/* B.2 (CQE 3) load the rest blocks. */
+		/* B.1 (CQE 3) load the rest of blocks. */
 		"ld1 {v16.16b - v18.16b}, [%[p3]] \n\t"
+		/* B.2 (CQE 3) move the block having op_own. */
+		"mov v19.16b, %[c3].16b \n\t"
 		/* B.3 (CQE 3) extract 16B fields. */
 		"tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 2) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
 		/* B.4 (CQE 3) adjust CRC length. */
 		"sub v23.8h, v23.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 2) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p2]] \n\t"
-		"sub %[p2], %[p2], #48 \n\t"
 		/* C.1 (CQE 3) generate final structure for mbuf. */
 		"tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 2) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
+		/* B.2 (CQE 2) move the block having op_own. */
+		"mov v19.16b, %[c2].16b \n\t"
 		/* B.3 (CQE 2) extract 16B fields. */
 		"tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 1) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
 		/* B.4 (CQE 2) adjust CRC length. */
 		"sub v22.8h, v22.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 1) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p1]] \n\t"
-		"sub %[p1], %[p1], #48 \n\t"
 		/* C.1 (CQE 2) generate final structure for mbuf. */
 		"tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 1) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
+		/* B.2 (CQE 1) move the block having op_own. */
+		"mov v19.16b, %[c1].16b \n\t"
 		/* B.3 (CQE 1) extract 16B fields. */
 		"tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 0) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
 		/* B.4 (CQE 1) adjust CRC length. */
 		"sub v21.8h, v21.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 0) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p0]] \n\t"
-		"sub %[p0], %[p0], #48 \n\t"
 		/* C.1 (CQE 1) generate final structure for mbuf. */
 		"tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 0) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
+		/* B.2 (CQE 0) move the block having op_own. */
+		"mov v19.16b, %[c0].16b \n\t"
+		/* A.1 load mbuf pointers. */
+		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* B.3 (CQE 0) extract 16B fields. */
 		"tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
 		/* B.4 (CQE 0) adjust CRC length. */
 		"sub v20.8h, v20.8h, %[crc_adj].8h \n\t"
-		/* A.1 load mbuf pointers. */
-		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* D.1 extract op_own byte. */
 		"tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b \n\t"
 		/* C.2 (CQE 3) adjust flow mark. */
@@ -901,9 +908,9 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		 [byte_cnt]"=&w"(byte_cnt),
 		 [ptype_info]"=&w"(ptype_info),
 		 [flow_tag]"=&w"(flow_tag)
-		:[p3]"r"(p3 + 48), [p2]"r"(p2 + 48),
-		 [p1]"r"(p1 + 48), [p0]"r"(p0 + 48),
+		:[p3]"r"(p3), [p2]"r"(p2), [p1]"r"(p1), [p0]"r"(p0),
 		 [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
+		 [c3]"w"(c3), [c2]"w"(c2), [c1]"w"(c1), [c0]"w"(c0),
 		 [elts_p]"r"(elts_p),
 		 [pkts_p]"r"(pkts_p),
 		 [cqe_shuf_m]"w"(cqe_shuf_m),
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index f25681184..3b90adffa 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -821,7 +821,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		/* B.2 copy mbuf pointers. */
 		_mm_storeu_si128((__m128i *)&pkts[pos], mbp1);
 		_mm_storeu_si128((__m128i *)&pkts[pos + 2], mbp2);
-		rte_compiler_barrier();
+		rte_io_rmb();
 		/* C.1 load remained CQE data and extract necessary fields. */
 		cqe_tmp2 = _mm_load_si128((__m128i *)&cq[pos + p3]);
 		cqe_tmp1 = _mm_load_si128((__m128i *)&cq[pos + p2]);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers
  2017-12-27  4:28 [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Yongseok Koh
  2017-12-27  4:28 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
@ 2018-01-04 12:58 ` Jerin Jacob
  2018-01-08  1:55 ` Jianbo Liu
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
  3 siblings, 0 replies; 49+ messages in thread
From: Jerin Jacob @ 2018-01-04 12:58 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: adrien.mazarguil, nelio.laranjeiro, jianbo.liu, dev, Thomas Speier

-----Original Message-----
> Date: Tue, 26 Dec 2017 20:28:23 -0800
> From: Yongseok Koh <yskoh@mellanox.com>
> To: adrien.mazarguil@6wind.com, nelio.laranjeiro@6wind.com,
>  jerin.jacob@caviumnetworks.com, jianbo.liu@arm.com
> CC: dev@dpdk.org, Yongseok Koh <yskoh@mellanox.com>, Thomas Speier
>  <tspeier@qti.qualcomm.com>
> Subject: [PATCH 1/2] eal/arm64: modify I/O device memory barriers
> X-Mailer: git-send-email 2.11.0
> 
> Instead of using system-wide 'dsb' instruction for IO barriers, 'dmb' is
> sufficient and could bring better performance. Using 'dmb' with Outer
> Shareable Domain option is also consistent with linux kernel.
> 
> Cc: Thomas Speier <tspeier@qti.qualcomm.com>
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
> Acked-by: Shahaf Shuler <shahafs@mellanox.com>

Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>

> ---
>  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 0b70d6209..8dcce6054 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -58,11 +58,11 @@ extern "C" {
>  
>  #define rte_smp_rmb() dmb(ishld)
>  
> -#define rte_io_mb() rte_mb()
> +#define rte_io_mb() dmb(osh)
>  
> -#define rte_io_wmb() rte_wmb()
> +#define rte_io_wmb() dmb(oshst)
>  
> -#define rte_io_rmb() rte_rmb()
> +#define rte_io_rmb() dmb(oshld)
>  
>  #ifdef __cplusplus
>  }
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers
  2017-12-27  4:28 [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Yongseok Koh
  2017-12-27  4:28 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
  2018-01-04 12:58 ` [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Jerin Jacob
@ 2018-01-08  1:55 ` Jianbo Liu
  2018-01-16  0:42   ` Yongseok Koh
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
  3 siblings, 1 reply; 49+ messages in thread
From: Jianbo Liu @ 2018-01-08  1:55 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev, Thomas Speier

The 12/26/2017 20:28, Yongseok Koh wrote:
> Instead of using system-wide 'dsb' instruction for IO barriers, 'dmb' is
> sufficient and could bring better performance. Using 'dmb' with Outer
> Shareable Domain option is also consistent with linux kernel.

But in kernel dsb is used for io barriers.
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/io.h#L109

Do you consider adding dma_*mb?
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/barrier.h#L40

>
> Cc: Thomas Speier <tspeier@qti.qualcomm.com>
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
> Acked-by: Shahaf Shuler <shahafs@mellanox.com>
> ---
>  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index 0b70d6209..8dcce6054 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -58,11 +58,11 @@ extern "C" {
>
>  #define rte_smp_rmb() dmb(ishld)
>
> -#define rte_io_mb() rte_mb()
> +#define rte_io_mb() dmb(osh)
>
> -#define rte_io_wmb() rte_wmb()
> +#define rte_io_wmb() dmb(oshst)
>
> -#define rte_io_rmb() rte_rmb()
> +#define rte_io_rmb() dmb(oshld)
>
>  #ifdef __cplusplus
>  }
> --
> 2.11.0
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers
  2018-01-08  1:55 ` Jianbo Liu
@ 2018-01-16  0:42   ` Yongseok Koh
  0 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  0:42 UTC (permalink / raw)
  To: Jianbo Liu
  Cc: Adrien Mazarguil, Nélio Laranjeiro, jerin.jacob, dev, Thomas Speier


> On Jan 7, 2018, at 5:55 PM, Jianbo Liu <Jianbo.Liu@arm.com> wrote:
> 
> The 12/26/2017 20:28, Yongseok Koh wrote:
>> Instead of using system-wide 'dsb' instruction for IO barriers, 'dmb' is
>> sufficient and could bring better performance. Using 'dmb' with Outer
>> Shareable Domain option is also consistent with linux kernel.
> 
> But in kernel dsb is used for io barriers.
> Do you consider adding dma_*mb?

Right. I'll send out a patchset, which adds rte_dma_rmb/wmb() today.

Thanks
Yongseok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 0/8] introduce DMA memory barriers
  2017-12-27  4:28 [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Yongseok Koh
                   ` (2 preceding siblings ...)
  2018-01-08  1:55 ` Jianbo Liu
@ 2018-01-16  1:10 ` Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
                     ` (8 more replies)
  3 siblings, 9 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

This patchset is to introduce DMA memory barriers, which could be more
efficient for coherent memory between I/O device and CPU, especially for
ARMv8.

Yongseok Koh (8):
  eal: introduce DMA memory barriers
  eal/x86: define DMA memory barriers
  eal/ppc64: define DMA device memory barriers
  eal/armv7: define DMA memory barriers
  eal/arm64: define DMA memory barriers
  net/mlx5: remove unnecessary memory barrier
  net/mlx5: replace IO memory barrier with DMA memory barrier
  net/mlx5: fix synchonization on polling Rx completions

 drivers/net/mlx5/mlx5_rxq.c                        |  1 -
 drivers/net/mlx5/mlx5_rxtx.c                       |  5 +-
 drivers/net/mlx5/mlx5_rxtx.h                       |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                   |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h              | 53 ++++++++++++----------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h               |  2 +-
 .../common/include/arch/arm/rte_atomic_32.h        |  4 ++
 .../common/include/arch/arm/rte_atomic_64.h        |  4 ++
 .../common/include/arch/ppc_64/rte_atomic.h        |  4 ++
 .../common/include/arch/x86/rte_atomic.h           |  4 ++
 lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++
 11 files changed, 70 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  2:47     ` Jianbo Liu
  2018-01-16  7:49     ` Andrew Rybchenko
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 2/8] eal/x86: define " Yongseok Koh
                     ` (7 subsequent siblings)
  8 siblings, 2 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
guarantee the ordering of coherent shared memory between the CPU and a DMA
capable device.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 16af5ca57..2e0503ce6 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
  */
 static inline void rte_io_rmb(void);
 
+/**
+ * Write memory barrier for coherent memory between lcore and IO device
+ *
+ * Guarantees that the STORE operations on coherent memory that
+ * precede the rte_dma_wmb() call are visible to I/O device before the
+ * STORE operations that follow it.
+ */
+static inline void rte_dma_wmb(void);
+
+/**
+ * Read memory barrier for coherent memory between lcore and IO device
+ *
+ * Guarantees that the LOAD operations on coherent memory updated by
+ * IO device that precede the rte_dma_rmb() call are visible to CPU
+ * before the LOAD operations that follow it.
+ */
+static inline void rte_dma_rmb(void);
+
 #endif /* __DOXYGEN__ */
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 2/8] eal/x86: define DMA memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 3/8] eal/ppc64: define DMA device " Yongseok Koh
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/x86/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
index 8469f97e1..4def21d24 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
@@ -38,6 +38,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_compiler_barrier()
 
+#define rte_dma_wmb() rte_compiler_barrier()
+
+#define rte_dma_rmb() rte_compiler_barrier()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 
 #ifndef RTE_FORCE_INTRINSICS
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 3/8] eal/ppc64: define DMA device memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 2/8] eal/x86: define " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA " Yongseok Koh
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
index 150810cdb..46490f2b3 100644
--- a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
@@ -93,6 +93,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 /* To be compatible with Power7, use GCC built-in functions for 16 bit
  * operations */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (2 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 3/8] eal/ppc64: define DMA device " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  2:48     ` Jianbo Liu
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 5/8] eal/arm64: " Yongseok Koh
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_32.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
index 14c048640..a45d5aa2a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
@@ -79,6 +79,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 5/8] eal/arm64: define DMA memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (3 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  2:50     ` Jianbo Liu
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, Thomas Speier

Cc: Thomas Speier <tspeier@qti.qualcomm.com>

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index b6bbd0b32..202abda79 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() dmb(oshst)
+
+#define rte_dma_rmb() dmb(oshld)
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 6/8] net/mlx5: remove unnecessary memory barrier
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (4 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 5/8] eal/arm64: " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
                     ` (2 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

As rte_write64() has an IO barrier, there's no need to have a barrier
before the call.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 950472754..11438c86a 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -517,7 +517,6 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 	doorbell = (uint64_t)doorbell_hi << 32;
 	doorbell |=  rxq->cqn;
 	rxq->cq_db[MLX5_CQ_ARM_DB] = rte_cpu_to_be_32(doorbell_hi);
-	rte_wmb();
 	rte_write64(rte_cpu_to_be_64(doorbell), cq_db_reg);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 7/8] net/mlx5: replace IO memory barrier with DMA memory barrier
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (5 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c     | 4 ++--
 drivers/net/mlx5/mlx5_rxtx.h     | 2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 3b8f71c28..99a5f8681 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1898,9 +1898,9 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		return 0;
 	/* Update the consumer index. */
 	rxq->rq_ci = rq_ci >> sges_n;
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 #ifdef MLX5_PMD_SOFT_COUNTERS
 	/* Increment packets counter. */
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index a239642ac..480653f34 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -598,7 +598,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	uint64_t *dst = (uint64_t *)((uintptr_t)txq->bf_reg);
 	volatile uint64_t *src = ((volatile uint64_t *)wqe);
 
-	rte_io_wmb();
+	rte_dma_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
 	rte_wmb();
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index 7d7f016f1..9db1dddbe 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -135,7 +135,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 	elts_idx = rxq->rq_ci & q_mask;
 	for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
 		(*rxq->elts)[elts_idx + i] = &rxq->fake_mbuf;
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (6 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
@ 2018-01-16  1:10   ` Yongseok Koh
  2018-01-16  3:53     ` Jianbo Liu
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
  8 siblings, 1 reply; 49+ messages in thread
From: Yongseok Koh @ 2018-01-16  1:10 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, stable

Polling a new packet is basically sensing the generation bit in a
completion entry. For some processors not having strongly-ordered memory
model, there has to be an IO memory barrier between reading the generation
bit and other fields of the entry in order to guarantee data is not stale.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
---
 drivers/net/mlx5/mlx5_rxtx.c          |  1 +
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 53 ++++++++++++++++++++---------------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h  |  2 +-
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 99a5f8681..8065d9d0b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1669,6 +1669,7 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 			return 0;
 		++rxq->cq_ci;
 		op_own = cqe->op_own;
+		rte_dma_rmb();
 		if (MLX5_CQE_FORMAT(op_own) == MLX5_COMPRESSED) {
 			volatile struct mlx5_mini_cqe8 (*mc)[8] =
 				(volatile struct mlx5_mini_cqe8 (*)[8])
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index e11565f69..29ae933e7 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -814,6 +814,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		uint16x4_t mask;
 		uint16x4_t byte_cnt;
 		uint32x4_t ptype_info, flow_tag;
+		register uint64x2_t c0, c1, c2, c3;
 		uint8_t *p0, *p1, *p2, *p3;
 		uint8_t *e0 = (void *)&elts[pos]->pkt_len;
 		uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len;
@@ -830,6 +831,16 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe);
 		p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe);
 		p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe);
+		/* B.0 (CQE 3) load a block having op_own. */
+		c3 = vld1q_u64((uint64_t *)(p3 + 48));
+		/* B.0 (CQE 2) load a block having op_own. */
+		c2 = vld1q_u64((uint64_t *)(p2 + 48));
+		/* B.0 (CQE 1) load a block having op_own. */
+		c1 = vld1q_u64((uint64_t *)(p1 + 48));
+		/* B.0 (CQE 0) load a block having op_own. */
+		c0 = vld1q_u64((uint64_t *)(p0 + 48));
+		/* Synchronize for loading the rest of blocks. */
+		rte_dma_rmb();
 		/* Prefetch next 4 CQEs. */
 		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
 			unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;
@@ -839,50 +850,46 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 			rte_prefetch_non_temporal(&cq[next + 3]);
 		}
 		__asm__ volatile (
-		/* B.1 (CQE 3) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p3]] \n\t"
-		"sub %[p3], %[p3], #48 \n\t"
-		/* B.2 (CQE 3) load the rest blocks. */
+		/* B.1 (CQE 3) load the rest of blocks. */
 		"ld1 {v16.16b - v18.16b}, [%[p3]] \n\t"
+		/* B.2 (CQE 3) move the block having op_own. */
+		"mov v19.16b, %[c3].16b \n\t"
 		/* B.3 (CQE 3) extract 16B fields. */
 		"tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 2) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
 		/* B.4 (CQE 3) adjust CRC length. */
 		"sub v23.8h, v23.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 2) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p2]] \n\t"
-		"sub %[p2], %[p2], #48 \n\t"
 		/* C.1 (CQE 3) generate final structure for mbuf. */
 		"tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 2) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
+		/* B.2 (CQE 2) move the block having op_own. */
+		"mov v19.16b, %[c2].16b \n\t"
 		/* B.3 (CQE 2) extract 16B fields. */
 		"tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 1) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
 		/* B.4 (CQE 2) adjust CRC length. */
 		"sub v22.8h, v22.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 1) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p1]] \n\t"
-		"sub %[p1], %[p1], #48 \n\t"
 		/* C.1 (CQE 2) generate final structure for mbuf. */
 		"tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 1) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
+		/* B.2 (CQE 1) move the block having op_own. */
+		"mov v19.16b, %[c1].16b \n\t"
 		/* B.3 (CQE 1) extract 16B fields. */
 		"tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 0) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
 		/* B.4 (CQE 1) adjust CRC length. */
 		"sub v21.8h, v21.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 0) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p0]] \n\t"
-		"sub %[p0], %[p0], #48 \n\t"
 		/* C.1 (CQE 1) generate final structure for mbuf. */
 		"tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 0) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
+		/* B.2 (CQE 0) move the block having op_own. */
+		"mov v19.16b, %[c0].16b \n\t"
+		/* A.1 load mbuf pointers. */
+		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* B.3 (CQE 0) extract 16B fields. */
 		"tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
 		/* B.4 (CQE 0) adjust CRC length. */
 		"sub v20.8h, v20.8h, %[crc_adj].8h \n\t"
-		/* A.1 load mbuf pointers. */
-		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* D.1 extract op_own byte. */
 		"tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b \n\t"
 		/* C.2 (CQE 3) adjust flow mark. */
@@ -917,9 +924,9 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		 [byte_cnt]"=&w"(byte_cnt),
 		 [ptype_info]"=&w"(ptype_info),
 		 [flow_tag]"=&w"(flow_tag)
-		:[p3]"r"(p3 + 48), [p2]"r"(p2 + 48),
-		 [p1]"r"(p1 + 48), [p0]"r"(p0 + 48),
+		:[p3]"r"(p3), [p2]"r"(p2), [p1]"r"(p1), [p0]"r"(p0),
 		 [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
+		 [c3]"w"(c3), [c2]"w"(c2), [c1]"w"(c1), [c0]"w"(c0),
 		 [elts_p]"r"(elts_p),
 		 [pkts_p]"r"(pkts_p),
 		 [cqe_shuf_m]"w"(cqe_shuf_m),
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index 559b0237e..6c4d1c3d5 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -833,7 +833,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		/* B.2 copy mbuf pointers. */
 		_mm_storeu_si128((__m128i *)&pkts[pos], mbp1);
 		_mm_storeu_si128((__m128i *)&pkts[pos + 2], mbp2);
-		rte_compiler_barrier();
+		rte_dma_rmb();
 		/* C.1 load remained CQE data and extract necessary fields. */
 		cqe_tmp2 = _mm_load_si128((__m128i *)&cq[pos + p3]);
 		cqe_tmp1 = _mm_load_si128((__m128i *)&cq[pos + p2]);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
@ 2018-01-16  2:47     ` Jianbo Liu
  2018-01-16  7:49     ` Andrew Rybchenko
  1 sibling, 0 replies; 49+ messages in thread
From: Jianbo Liu @ 2018-01-16  2:47 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev

The 01/15/2018 17:10, Yongseok Koh wrote:
> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> guarantee the ordering of coherent shared memory between the CPU and a DMA
> capable device.
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

Acked-by: Jianbo Liu <jianbo.liu@arm.com>

> ---
>  lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
> index 16af5ca57..2e0503ce6 100644
> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> @@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
>   */
>  static inline void rte_io_rmb(void);
>
> +/**
> + * Write memory barrier for coherent memory between lcore and IO device
> + *
> + * Guarantees that the STORE operations on coherent memory that
> + * precede the rte_dma_wmb() call are visible to I/O device before the
> + * STORE operations that follow it.
> + */
> +static inline void rte_dma_wmb(void);
> +
> +/**
> + * Read memory barrier for coherent memory between lcore and IO device
> + *
> + * Guarantees that the LOAD operations on coherent memory updated by
> + * IO device that precede the rte_dma_rmb() call are visible to CPU
> + * before the LOAD operations that follow it.
> + */
> +static inline void rte_dma_rmb(void);
> +
>  #endif /* __DOXYGEN__ */
>
>  /**
> --
> 2.11.0
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA memory barriers
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA " Yongseok Koh
@ 2018-01-16  2:48     ` Jianbo Liu
  0 siblings, 0 replies; 49+ messages in thread
From: Jianbo Liu @ 2018-01-16  2:48 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev

The 01/15/2018 17:10, Yongseok Koh wrote:
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>  lib/librte_eal/common/include/arch/arm/rte_atomic_32.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
> index 14c048640..a45d5aa2a 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
> @@ -79,6 +79,10 @@ extern "C" {
>
>  #define rte_io_rmb() rte_rmb()
>
> +#define rte_dma_wmb() rte_wmb()
> +
> +#define rte_dma_rmb() rte_rmb()
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --
> 2.11.0
>

Acked-by: Jianbo Liu <jianbo.liu@arm.com>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/8] eal/arm64: define DMA memory barriers
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 5/8] eal/arm64: " Yongseok Koh
@ 2018-01-16  2:50     ` Jianbo Liu
  0 siblings, 0 replies; 49+ messages in thread
From: Jianbo Liu @ 2018-01-16  2:50 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev, Thomas Speier

The 01/15/2018 17:10, Yongseok Koh wrote:
> Cc: Thomas Speier <tspeier@qti.qualcomm.com>
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>

Acked-by: Jianbo Liu <jianbo.liu@arm.com>

> ---
>  lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> index b6bbd0b32..202abda79 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
> @@ -36,6 +36,10 @@ extern "C" {
>
>  #define rte_io_rmb() rte_rmb()
>
> +#define rte_dma_wmb() dmb(oshst)
> +
> +#define rte_dma_rmb() dmb(oshld)
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --
> 2.11.0
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
@ 2018-01-16  3:53     ` Jianbo Liu
  0 siblings, 0 replies; 49+ messages in thread
From: Jianbo Liu @ 2018-01-16  3:53 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev, stable

The 01/15/2018 17:10, Yongseok Koh wrote:
> Polling a new packet is basically sensing the generation bit in a
> completion entry. For some processors not having strongly-ordered memory
> model, there has to be an IO memory barrier between reading the generation
> bit and other fields of the entry in order to guarantee data is not stale.
>
> Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
> Cc: stable@dpdk.org
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Shahaf Shuler <shahafs@mellanox.com>
> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>

Acked-by: Jianbo Liu <jianbo.liu@arm.com>

> ---
>  drivers/net/mlx5/mlx5_rxtx.c          |  1 +
>  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 53 ++++++++++++++++++++---------------
>  drivers/net/mlx5/mlx5_rxtx_vec_sse.h  |  2 +-
>  3 files changed, 32 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
> index 99a5f8681..8065d9d0b 100644
> --- a/drivers/net/mlx5/mlx5_rxtx.c
> +++ b/drivers/net/mlx5/mlx5_rxtx.c
> @@ -1669,6 +1669,7 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
>                       return 0;
>               ++rxq->cq_ci;
>               op_own = cqe->op_own;
> +             rte_dma_rmb();
>               if (MLX5_CQE_FORMAT(op_own) == MLX5_COMPRESSED) {
>                       volatile struct mlx5_mini_cqe8 (*mc)[8] =
>                               (volatile struct mlx5_mini_cqe8 (*)[8])
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index e11565f69..29ae933e7 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -814,6 +814,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
>               uint16x4_t mask;
>               uint16x4_t byte_cnt;
>               uint32x4_t ptype_info, flow_tag;
> +             register uint64x2_t c0, c1, c2, c3;
>               uint8_t *p0, *p1, *p2, *p3;
>               uint8_t *e0 = (void *)&elts[pos]->pkt_len;
>               uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len;
> @@ -830,6 +831,16 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
>               p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe);
>               p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe);
>               p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe);
> +             /* B.0 (CQE 3) load a block having op_own. */
> +             c3 = vld1q_u64((uint64_t *)(p3 + 48));
> +             /* B.0 (CQE 2) load a block having op_own. */
> +             c2 = vld1q_u64((uint64_t *)(p2 + 48));
> +             /* B.0 (CQE 1) load a block having op_own. */
> +             c1 = vld1q_u64((uint64_t *)(p1 + 48));
> +             /* B.0 (CQE 0) load a block having op_own. */
> +             c0 = vld1q_u64((uint64_t *)(p0 + 48));
> +             /* Synchronize for loading the rest of blocks. */
> +             rte_dma_rmb();
>               /* Prefetch next 4 CQEs. */
>               if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
>                       unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;
> @@ -839,50 +850,46 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
>                       rte_prefetch_non_temporal(&cq[next + 3]);
>               }
>               __asm__ volatile (
> -             /* B.1 (CQE 3) load a block having op_own. */
> -             "ld1 {v19.16b}, [%[p3]] \n\t"
> -             "sub %[p3], %[p3], #48 \n\t"
> -             /* B.2 (CQE 3) load the rest blocks. */
> +             /* B.1 (CQE 3) load the rest of blocks. */
>               "ld1 {v16.16b - v18.16b}, [%[p3]] \n\t"
> +             /* B.2 (CQE 3) move the block having op_own. */
> +             "mov v19.16b, %[c3].16b \n\t"
>               /* B.3 (CQE 3) extract 16B fields. */
>               "tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
> +             /* B.1 (CQE 2) load the rest of blocks. */
> +             "ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
>               /* B.4 (CQE 3) adjust CRC length. */
>               "sub v23.8h, v23.8h, %[crc_adj].8h \n\t"
> -             /* B.1 (CQE 2) load a block having op_own. */
> -             "ld1 {v19.16b}, [%[p2]] \n\t"
> -             "sub %[p2], %[p2], #48 \n\t"
>               /* C.1 (CQE 3) generate final structure for mbuf. */
>               "tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b \n\t"
> -             /* B.2 (CQE 2) load the rest blocks. */
> -             "ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
> +             /* B.2 (CQE 2) move the block having op_own. */
> +             "mov v19.16b, %[c2].16b \n\t"
>               /* B.3 (CQE 2) extract 16B fields. */
>               "tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
> +             /* B.1 (CQE 1) load the rest of blocks. */
> +             "ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
>               /* B.4 (CQE 2) adjust CRC length. */
>               "sub v22.8h, v22.8h, %[crc_adj].8h \n\t"
> -             /* B.1 (CQE 1) load a block having op_own. */
> -             "ld1 {v19.16b}, [%[p1]] \n\t"
> -             "sub %[p1], %[p1], #48 \n\t"
>               /* C.1 (CQE 2) generate final structure for mbuf. */
>               "tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b \n\t"
> -             /* B.2 (CQE 1) load the rest blocks. */
> -             "ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
> +             /* B.2 (CQE 1) move the block having op_own. */
> +             "mov v19.16b, %[c1].16b \n\t"
>               /* B.3 (CQE 1) extract 16B fields. */
>               "tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
> +             /* B.1 (CQE 0) load the rest of blocks. */
> +             "ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
>               /* B.4 (CQE 1) adjust CRC length. */
>               "sub v21.8h, v21.8h, %[crc_adj].8h \n\t"
> -             /* B.1 (CQE 0) load a block having op_own. */
> -             "ld1 {v19.16b}, [%[p0]] \n\t"
> -             "sub %[p0], %[p0], #48 \n\t"
>               /* C.1 (CQE 1) generate final structure for mbuf. */
>               "tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b \n\t"
> -             /* B.2 (CQE 0) load the rest blocks. */
> -             "ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
> +             /* B.2 (CQE 0) move the block having op_own. */
> +             "mov v19.16b, %[c0].16b \n\t"
> +             /* A.1 load mbuf pointers. */
> +             "ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
>               /* B.3 (CQE 0) extract 16B fields. */
>               "tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
>               /* B.4 (CQE 0) adjust CRC length. */
>               "sub v20.8h, v20.8h, %[crc_adj].8h \n\t"
> -             /* A.1 load mbuf pointers. */
> -             "ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
>               /* D.1 extract op_own byte. */
>               "tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b \n\t"
>               /* C.2 (CQE 3) adjust flow mark. */
> @@ -917,9 +924,9 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
>                [byte_cnt]"=&w"(byte_cnt),
>                [ptype_info]"=&w"(ptype_info),
>                [flow_tag]"=&w"(flow_tag)
> -             :[p3]"r"(p3 + 48), [p2]"r"(p2 + 48),
> -              [p1]"r"(p1 + 48), [p0]"r"(p0 + 48),
> +             :[p3]"r"(p3), [p2]"r"(p2), [p1]"r"(p1), [p0]"r"(p0),
>                [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
> +              [c3]"w"(c3), [c2]"w"(c2), [c1]"w"(c1), [c0]"w"(c0),
>                [elts_p]"r"(elts_p),
>                [pkts_p]"r"(pkts_p),
>                [cqe_shuf_m]"w"(cqe_shuf_m),
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> index 559b0237e..6c4d1c3d5 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
> @@ -833,7 +833,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
>               /* B.2 copy mbuf pointers. */
>               _mm_storeu_si128((__m128i *)&pkts[pos], mbp1);
>               _mm_storeu_si128((__m128i *)&pkts[pos + 2], mbp2);
> -             rte_compiler_barrier();
> +             rte_dma_rmb();
>               /* C.1 load remained CQE data and extract necessary fields. */
>               cqe_tmp2 = _mm_load_si128((__m128i *)&cq[pos + p3]);
>               cqe_tmp1 = _mm_load_si128((__m128i *)&cq[pos + p2]);
> --
> 2.11.0
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
  2018-01-16  2:47     ` Jianbo Liu
@ 2018-01-16  7:49     ` Andrew Rybchenko
  2018-01-16  9:10       ` Jianbo Liu
  1 sibling, 1 reply; 49+ messages in thread
From: Andrew Rybchenko @ 2018-01-16  7:49 UTC (permalink / raw)
  To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro, jerin.jacob,
	jianbo.liu
  Cc: dev

On 01/16/2018 04:10 AM, Yongseok Koh wrote:
> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> guarantee the ordering of coherent shared memory between the CPU and a DMA
> capable device.
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>   lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
> index 16af5ca57..2e0503ce6 100644
> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> @@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
>    */
>   static inline void rte_io_rmb(void);
>   
> +/**
> + * Write memory barrier for coherent memory between lcore and IO device
> + *
> + * Guarantees that the STORE operations on coherent memory that
> + * precede the rte_dma_wmb() call are visible to I/O device before the
> + * STORE operations that follow it.
> + */
> +static inline void rte_dma_wmb(void);
> +
> +/**
> + * Read memory barrier for coherent memory between lcore and IO device
> + *
> + * Guarantees that the LOAD operations on coherent memory updated by
> + * IO device that precede the rte_dma_rmb() call are visible to CPU
> + * before the LOAD operations that follow it.
> + */
> +static inline void rte_dma_rmb(void);
> +
>   #endif /* __DOXYGEN__ */
>   
>   /**

I'm not an ARMv8 expert so, my comments could be a bit ignorant.
I'd like to understand the difference between io and added here dma 
barriers.
The difference should be clearly explained. Otherwise we'll constantly hit
on incorrect choice of barrier type.
Also I don't understand why "dma" name is chosen taking into account
that definition is bound to coherent memory between lcore and IO device.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-16  7:49     ` Andrew Rybchenko
@ 2018-01-16  9:10       ` Jianbo Liu
  2018-01-17 13:46         ` Thomas Monjalon
  0 siblings, 1 reply; 49+ messages in thread
From: Jianbo Liu @ 2018-01-16  9:10 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro, jerin.jacob, dev

The 01/16/2018 10:49, Andrew Rybchenko wrote:
> On 01/16/2018 04:10 AM, Yongseok Koh wrote:
> >This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> >guarantee the ordering of coherent shared memory between the CPU and a DMA
> >capable device.
> >
> >Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> >---
> >  lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> >diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
> >index 16af5ca57..2e0503ce6 100644
> >--- a/lib/librte_eal/common/include/generic/rte_atomic.h
> >+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> >@@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
> >   */
> >  static inline void rte_io_rmb(void);
> >+/**
> >+ * Write memory barrier for coherent memory between lcore and IO device
> >+ *
> >+ * Guarantees that the STORE operations on coherent memory that
> >+ * precede the rte_dma_wmb() call are visible to I/O device before the
> >+ * STORE operations that follow it.
> >+ */
> >+static inline void rte_dma_wmb(void);
> >+
> >+/**
> >+ * Read memory barrier for coherent memory between lcore and IO device
> >+ *
> >+ * Guarantees that the LOAD operations on coherent memory updated by
> >+ * IO device that precede the rte_dma_rmb() call are visible to CPU
> >+ * before the LOAD operations that follow it.
> >+ */
> >+static inline void rte_dma_rmb(void);
> >+
> >  #endif /* __DOXYGEN__ */
> >  /**
>
> I'm not an ARMv8 expert so, my comments could be a bit ignorant.
> I'd like to understand the difference between io and added here dma
> barriers.
> The difference should be clearly explained. Otherwise we'll constantly hit
> on incorrect choice of barrier type.
> Also I don't understand why "dma" name is chosen taking into account
> that definition is bound to coherent memory between lcore and IO device.

A good explanation can be found here.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1077fa36f23e259858caf6f269a47393a5aff523

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-16  9:10       ` Jianbo Liu
@ 2018-01-17 13:46         ` Thomas Monjalon
  2018-01-17 18:39           ` Yongseok Koh
  0 siblings, 1 reply; 49+ messages in thread
From: Thomas Monjalon @ 2018-01-17 13:46 UTC (permalink / raw)
  To: Jianbo Liu, Andrew Rybchenko, Yongseok Koh
  Cc: dev, adrien.mazarguil, nelio.laranjeiro, jerin.jacob,
	konstantin.ananyev, bruce.richardson, Chao Zhu

16/01/2018 10:10, Jianbo Liu:
> The 01/16/2018 10:49, Andrew Rybchenko wrote:
> > On 01/16/2018 04:10 AM, Yongseok Koh wrote:
> > >This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> > >guarantee the ordering of coherent shared memory between the CPU and a DMA
> > >capable device.
> > >
> > >Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> > >---
> > >  lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
> > >  1 file changed, 18 insertions(+)
> > >
> > >diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
> > >index 16af5ca57..2e0503ce6 100644
> > >--- a/lib/librte_eal/common/include/generic/rte_atomic.h
> > >+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
> > >@@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
> > >   */
> > >  static inline void rte_io_rmb(void);
> > >+/**
> > >+ * Write memory barrier for coherent memory between lcore and IO device
> > >+ *
> > >+ * Guarantees that the STORE operations on coherent memory that
> > >+ * precede the rte_dma_wmb() call are visible to I/O device before the
> > >+ * STORE operations that follow it.
> > >+ */
> > >+static inline void rte_dma_wmb(void);
> > >+
> > >+/**
> > >+ * Read memory barrier for coherent memory between lcore and IO device
> > >+ *
> > >+ * Guarantees that the LOAD operations on coherent memory updated by
> > >+ * IO device that precede the rte_dma_rmb() call are visible to CPU
> > >+ * before the LOAD operations that follow it.
> > >+ */
> > >+static inline void rte_dma_rmb(void);
> > >+
> > >  #endif /* __DOXYGEN__ */
> > >  /**
> >
> > I'm not an ARMv8 expert so, my comments could be a bit ignorant.
> > I'd like to understand the difference between io and added here dma
> > barriers.
> > The difference should be clearly explained. Otherwise we'll constantly hit
> > on incorrect choice of barrier type.
> > Also I don't understand why "dma" name is chosen taking into account
> > that definition is bound to coherent memory between lcore and IO device.
> 
> A good explanation can be found here.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1077fa36f23e259858caf6f269a47393a5aff523

I agree that something more is needed to explain when to use rte_io_*.
The only difference between rte_io_* and rte_dma_* is "on coherent memory".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-17 13:46         ` Thomas Monjalon
@ 2018-01-17 18:39           ` Yongseok Koh
  2018-01-18 11:56             ` Andrew Rybchenko
  0 siblings, 1 reply; 49+ messages in thread
From: Yongseok Koh @ 2018-01-17 18:39 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Jianbo Liu, Andrew Rybchenko, dev, Adrien Mazarguil,
	Nélio Laranjeiro, jerin.jacob, konstantin.ananyev,
	bruce.richardson, Chao Zhu


> On Jan 17, 2018, at 5:46 AM, Thomas Monjalon <thomas@monjalon.net> wrote:
> 
> 16/01/2018 10:10, Jianbo Liu:
>> The 01/16/2018 10:49, Andrew Rybchenko wrote:
>>> On 01/16/2018 04:10 AM, Yongseok Koh wrote:
>>>> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
>>>> guarantee the ordering of coherent shared memory between the CPU and a DMA
>>>> capable device.
>>>> 
>>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>>> ---
>>>> lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
>>>> 1 file changed, 18 insertions(+)
>>>> 
>>>> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>> index 16af5ca57..2e0503ce6 100644
>>>> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
>>>> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>> @@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
>>>>  */
>>>> static inline void rte_io_rmb(void);
>>>> +/**
>>>> + * Write memory barrier for coherent memory between lcore and IO device
>>>> + *
>>>> + * Guarantees that the STORE operations on coherent memory that
>>>> + * precede the rte_dma_wmb() call are visible to I/O device before the
>>>> + * STORE operations that follow it.
>>>> + */
>>>> +static inline void rte_dma_wmb(void);
>>>> +
>>>> +/**
>>>> + * Read memory barrier for coherent memory between lcore and IO device
>>>> + *
>>>> + * Guarantees that the LOAD operations on coherent memory updated by
>>>> + * IO device that precede the rte_dma_rmb() call are visible to CPU
>>>> + * before the LOAD operations that follow it.
>>>> + */
>>>> +static inline void rte_dma_rmb(void);
>>>> +
>>>> #endif /* __DOXYGEN__ */
>>>> /**
>>> 
>>> I'm not an ARMv8 expert so, my comments could be a bit ignorant.
>>> I'd like to understand the difference between io and added here dma
>>> barriers.
>>> The difference should be clearly explained. Otherwise we'll constantly hit
>>> on incorrect choice of barrier type.
>>> Also I don't understand why "dma" name is chosen taking into account
>>> that definition is bound to coherent memory between lcore and IO device.
>> 
>> A good explanation can be found here.
>> 
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D1077fa36f23e259858caf6f269a47393a5aff523&data=02%7C01%7Cyskoh%40mellanox.com%7C7b526265cbf1449f3db208d55db0c55d%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636517936183877836&sdata=2%2Fi8Gs2n%2Fnbe9%2FJ3GWr22ndPmQVmvM2Xh12r3j1ZWlg%3D&reserved=0
> 
> I agree that something more is needed to explain when to use rte_io_*.
> The only difference between rte_io_* and rte_dma_* is "on coherent memory".

Okay will add more explanation and send out v3 soon. But, please note that
there's no concrete theory when to use which barrier. Actually, it is mostly
for ARMv8 because it provides more options for barriers. For other archs, as you
can see in the patches, there's no difference from IO barriers.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-17 18:39           ` Yongseok Koh
@ 2018-01-18 11:56             ` Andrew Rybchenko
  2018-01-18 18:14               ` Yongseok Koh
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Rybchenko @ 2018-01-18 11:56 UTC (permalink / raw)
  To: Yongseok Koh, Thomas Monjalon
  Cc: Jianbo Liu, dev, Adrien Mazarguil, Nélio Laranjeiro,
	jerin.jacob, konstantin.ananyev, bruce.richardson, Chao Zhu

On 01/17/2018 09:39 PM, Yongseok Koh wrote:
>> On Jan 17, 2018, at 5:46 AM, Thomas Monjalon <thomas@monjalon.net> wrote:
>>
>> 16/01/2018 10:10, Jianbo Liu:
>>> The 01/16/2018 10:49, Andrew Rybchenko wrote:
>>>> On 01/16/2018 04:10 AM, Yongseok Koh wrote:
>>>>> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
>>>>> guarantee the ordering of coherent shared memory between the CPU and a DMA
>>>>> capable device.
>>>>>
>>>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>>>> ---
>>>>> lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
>>>>> 1 file changed, 18 insertions(+)
>>>>>
>>>>> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>> index 16af5ca57..2e0503ce6 100644
>>>>> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>> @@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
>>>>>   */
>>>>> static inline void rte_io_rmb(void);
>>>>> +/**
>>>>> + * Write memory barrier for coherent memory between lcore and IO device
>>>>> + *
>>>>> + * Guarantees that the STORE operations on coherent memory that
>>>>> + * precede the rte_dma_wmb() call are visible to I/O device before the
>>>>> + * STORE operations that follow it.
>>>>> + */
>>>>> +static inline void rte_dma_wmb(void);
>>>>> +
>>>>> +/**
>>>>> + * Read memory barrier for coherent memory between lcore and IO device
>>>>> + *
>>>>> + * Guarantees that the LOAD operations on coherent memory updated by
>>>>> + * IO device that precede the rte_dma_rmb() call are visible to CPU
>>>>> + * before the LOAD operations that follow it.
>>>>> + */
>>>>> +static inline void rte_dma_rmb(void);
>>>>> +
>>>>> #endif /* __DOXYGEN__ */
>>>>> /**
>>>> I'm not an ARMv8 expert so, my comments could be a bit ignorant.
>>>> I'd like to understand the difference between io and added here dma
>>>> barriers.
>>>> The difference should be clearly explained. Otherwise we'll constantly hit
>>>> on incorrect choice of barrier type.
>>>> Also I don't understand why "dma" name is chosen taking into account
>>>> that definition is bound to coherent memory between lcore and IO device.
>>> A good explanation can be found here.
>>>
>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D1077fa36f23e259858caf6f269a47393a5aff523&data=02%7C01%7Cyskoh%40mellanox.com%7C7b526265cbf1449f3db208d55db0c55d%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636517936183877836&sdata=2%2Fi8Gs2n%2Fnbe9%2FJ3GWr22ndPmQVmvM2Xh12r3j1ZWlg%3D&reserved=0
>> I agree that something more is needed to explain when to use rte_io_*.
>> The only difference between rte_io_* and rte_dma_* is "on coherent memory".
> Okay will add more explanation and send out v3 soon. But, please note that
> there's no concrete theory when to use which barrier. Actually, it is mostly
> for ARMv8 because it provides more options for barriers. For other archs, as you
> can see in the patches, there's no difference from IO barriers.

Absence of concrete theory does not make choice of the memory barrier 
easier.
I would say it complicates it significantly. I think it is a minimal 
requirement for
the patchset to explain why a new type should be defined instead of just
fixing of the rte_io_* barriers on ARMv8. What's the different? Which 
criteria
should be checked/taken into account to make the right choice?

As far as I can see igb_uio and uio_pci_generic do coherent DMA mapping.
It is not that easy with VFIO since in theory it could be non-coherent if
snooping is not supported by IOMMU. Don't know if it is real.
If so, it makes barrier choice UIO driver-dependent. Sounds bad.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/8] eal: introduce DMA memory barriers
  2018-01-18 11:56             ` Andrew Rybchenko
@ 2018-01-18 18:14               ` Yongseok Koh
  0 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-18 18:14 UTC (permalink / raw)
  To: Andrew Rybchenko
  Cc: Thomas Monjalon, Jianbo Liu, dev, Adrien Mazarguil,
	Nélio Laranjeiro, jerin.jacob, konstantin.ananyev,
	bruce.richardson, Chao Zhu


> On Jan 18, 2018, at 3:56 AM, Andrew Rybchenko <arybchenko@solarflare.com> wrote:
> 
> On 01/17/2018 09:39 PM, Yongseok Koh wrote:
>>> On Jan 17, 2018, at 5:46 AM, Thomas Monjalon <thomas@monjalon.net> wrote:
>>> 
>>> 16/01/2018 10:10, Jianbo Liu:
>>>> The 01/16/2018 10:49, Andrew Rybchenko wrote:
>>>>> On 01/16/2018 04:10 AM, Yongseok Koh wrote:
>>>>>> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
>>>>>> guarantee the ordering of coherent shared memory between the CPU and a DMA
>>>>>> capable device.
>>>>>> 
>>>>>> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
>>>>>> ---
>>>>>> lib/librte_eal/common/include/generic/rte_atomic.h | 18 ++++++++++++++++++
>>>>>> 1 file changed, 18 insertions(+)
>>>>>> 
>>>>>> diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>>> index 16af5ca57..2e0503ce6 100644
>>>>>> --- a/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>>> +++ b/lib/librte_eal/common/include/generic/rte_atomic.h
>>>>>> @@ -98,6 +98,24 @@ static inline void rte_io_wmb(void);
>>>>>>  */
>>>>>> static inline void rte_io_rmb(void);
>>>>>> +/**
>>>>>> + * Write memory barrier for coherent memory between lcore and IO device
>>>>>> + *
>>>>>> + * Guarantees that the STORE operations on coherent memory that
>>>>>> + * precede the rte_dma_wmb() call are visible to I/O device before the
>>>>>> + * STORE operations that follow it.
>>>>>> + */
>>>>>> +static inline void rte_dma_wmb(void);
>>>>>> +
>>>>>> +/**
>>>>>> + * Read memory barrier for coherent memory between lcore and IO device
>>>>>> + *
>>>>>> + * Guarantees that the LOAD operations on coherent memory updated by
>>>>>> + * IO device that precede the rte_dma_rmb() call are visible to CPU
>>>>>> + * before the LOAD operations that follow it.
>>>>>> + */
>>>>>> +static inline void rte_dma_rmb(void);
>>>>>> +
>>>>>> #endif /* __DOXYGEN__ */
>>>>>> /**
>>>>> I'm not an ARMv8 expert so, my comments could be a bit ignorant.
>>>>> I'd like to understand the difference between io and added here dma
>>>>> barriers.
>>>>> The difference should be clearly explained. Otherwise we'll constantly hit
>>>>> on incorrect choice of barrier type.
>>>>> Also I don't understand why "dma" name is chosen taking into account
>>>>> that definition is bound to coherent memory between lcore and IO device.
>>>> A good explanation can be found here.
>>>> 
>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D1077fa36f23e259858caf6f269a47393a5aff523&data=02%7C01%7Cyskoh%40mellanox.com%7C7b526265cbf1449f3db208d55db0c55d%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636517936183877836&sdata=2%2Fi8Gs2n%2Fnbe9%2FJ3GWr22ndPmQVmvM2Xh12r3j1ZWlg%3D&reserved=0
>>> I agree that something more is needed to explain when to use rte_io_*.
>>> The only difference between rte_io_* and rte_dma_* is "on coherent memory".
>> Okay will add more explanation and send out v3 soon. But, please note that
>> there's no concrete theory when to use which barrier. Actually, it is mostly
>> for ARMv8 because it provides more options for barriers. For other archs, as you
>> can see in the patches, there's no difference from IO barriers.
> 
> Absence of concrete theory does not make choice of the memory barrier easier.

I didn't say that? I just explained it can't be super clear.

> I would say it complicates it significantly. I think it is a minimal requirement for
> the patchset to explain why a new type should be defined instead of just
> fixing of the rte_io_* barriers on ARMv8. What's the different? Which criteria
> should be checked/taken into account to make the right choice?

I already said I'll send out v3.

> As far as I can see igb_uio and uio_pci_generic do coherent DMA mapping.
> It is not that easy with VFIO since in theory it could be non-coherent if
> snooping is not supported by IOMMU. Don't know if it is real.
> If so, it makes barrier choice UIO driver-dependent. Sounds bad.

I'm not sure I understand this comment but it is because of relaxed memory
ordering model of some processors. Actually, x86 is the only processor which
has the strong memory ordering model. And the link Jianbo shared could be the
best article to understand it and I'll summarize it in my v3.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers
  2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
                     ` (7 preceding siblings ...)
  2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
@ 2018-01-19  0:44   ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 1/8] eal: " Yongseok Koh
                       ` (8 more replies)
  8 siblings, 9 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

This patchset is to introduce DMA memory barriers, which could be more
efficient for coherent memory between I/O device and CPU, especially for
ARMv8.

v3:
* add more detailed comments about the new memory barriers.

v2:
* introduce DMA memory barriers.

Yongseok Koh (8):
  eal: introduce DMA memory barriers
  eal/x86: define DMA memory barriers
  eal/ppc64: define DMA memory barriers
  eal/armv7: define DMA memory barriers
  eal/arm64: define DMA memory barriers
  net/mlx5: remove unnecessary memory barrier
  net/mlx5: replace IO memory barrier with DMA memory barrier
  net/mlx5: fix synchonization on polling Rx completions

 drivers/net/mlx5/mlx5_rxq.c                        |  1 -
 drivers/net/mlx5/mlx5_rxtx.c                       |  5 +-
 drivers/net/mlx5/mlx5_rxtx.h                       |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                   |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h              | 53 ++++++++++++----------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h               |  2 +-
 .../common/include/arch/arm/rte_atomic_32.h        |  4 ++
 .../common/include/arch/arm/rte_atomic_64.h        |  4 ++
 .../common/include/arch/ppc_64/rte_atomic.h        |  4 ++
 .../common/include/arch/x86/rte_atomic.h           |  4 ++
 lib/librte_eal/common/include/generic/rte_atomic.h | 52 +++++++++++++++++++++
 11 files changed, 104 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  7:16       ` Andrew Rybchenko
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 2/8] eal/x86: define " Yongseok Koh
                       ` (7 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
guarantee the ordering of coherent shared memory between the CPU and a DMA
capable device.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/generic/rte_atomic.h | 52 ++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 3ba7245a3..1ffa51e31 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -98,6 +98,58 @@ static inline void rte_io_wmb(void);
  */
 static inline void rte_io_rmb(void);
 
+/**
+ * Write memory barrier for coherent memory between lcore and IO device
+ *
+ * Guarantees that the STORE operations on coherent memory that
+ * precede the rte_dma_wmb() call are visible to I/O device before the
+ * STORE operations that follow it.
+ *
+ * DMA memory barrier is a lightweight version of I/O device barriers
+ * which are system-wide data synchronization barriers. This is for
+ * only coherent memory domain between lcore and IO device but it is
+ * same as the I/O device barriers in most of architectures. However,
+ * some architecture provides even lighter barriers which are
+ * somewhere in between I/O device barriers and SMP barriers. For
+ * example, in case of ARMv8, data memory barrier can have different
+ * shareability domains - inner-shareable and outer-shareable. And
+ * inner-shareable data memory barrier fits for SMP barriers and
+ * outer-shareable one for DMA barriers, which acts on coherent
+ * memory.
+ *
+ * In most cases, I/O device barriers are safer but if operations are
+ * on coherent memory instead of incoherent MMIO region of a device,
+ * then DMA barriers can be used and this could bring performance gain
+ * depending on architectures.
+ */
+static inline void rte_dma_wmb(void);
+
+/**
+ * Read memory barrier for coherent memory between lcore and IO device
+ *
+ * Guarantees that the LOAD operations on coherent memory updated by
+ * IO device that precede the rte_dma_rmb() call are visible to CPU
+ * before the LOAD operations that follow it.
+ *
+ * DMA memory barrier is a lightweight version of I/O device barriers
+ * which are system-wide data synchronization barriers. This is for
+ * only coherent memory domain between lcore and IO device but it is
+ * same as the I/O device barriers in most of architectures. However,
+ * some architecture provides even lighter barriers which are
+ * somewhere in between I/O device barriers and SMP barriers. For
+ * example, in case of ARMv8, data memory barrier can have different
+ * shareability domains - inner-shareable and outer-shareable. And
+ * inner-shareable data memory barrier fits for SMP barriers and
+ * outer-shareable one for DMA barriers, which acts on coherent
+ * memory.
+ *
+ * In most cases, I/O device barriers are safer but if operations are
+ * on coherent memory instead of incoherent MMIO region of a device,
+ * then DMA barriers can be used and this could bring performance gain
+ * depending on architectures.
+ */
+static inline void rte_dma_rmb(void);
+
 #endif /* __DOXYGEN__ */
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 2/8] eal/x86: define DMA memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 1/8] eal: " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 3/8] eal/ppc64: " Yongseok Koh
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/x86/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
index 36cfabc38..ae41e615f 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
@@ -39,6 +39,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_compiler_barrier()
 
+#define rte_dma_wmb() rte_compiler_barrier()
+
+#define rte_dma_rmb() rte_compiler_barrier()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 
 #ifndef RTE_FORCE_INTRINSICS
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 3/8] eal/ppc64: define DMA memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 1/8] eal: " Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 2/8] eal/x86: define " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 4/8] eal/armv7: " Yongseok Koh
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
index 150810cdb..46490f2b3 100644
--- a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
@@ -93,6 +93,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 /* To be compatible with Power7, use GCC built-in functions for 16 bit
  * operations */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 4/8] eal/armv7: define DMA memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (2 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 3/8] eal/ppc64: " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 5/8] eal/arm64: " Yongseok Koh
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Jianbo Liu <jianbo.liu@arm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_32.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
index 14c048640..a45d5aa2a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
@@ -79,6 +79,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() rte_wmb()
+
+#define rte_dma_rmb() rte_rmb()
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 5/8] eal/arm64: define DMA memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (3 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 4/8] eal/armv7: " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, Thomas Speier

Cc: Thomas Speier <tspeier@qti.qualcomm.com>

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
Acked-by: Jianbo Liu <jianbo.liu@arm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index b6bbd0b32..202abda79 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_dma_wmb() dmb(oshst)
+
+#define rte_dma_rmb() dmb(oshld)
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 6/8] net/mlx5: remove unnecessary memory barrier
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (4 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 5/8] eal/arm64: " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

As rte_write64() has an IO barrier, there's no need to have a barrier
before the call.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 950472754..11438c86a 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -517,7 +517,6 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 	doorbell = (uint64_t)doorbell_hi << 32;
 	doorbell |=  rxq->cqn;
 	rxq->cq_db[MLX5_CQ_ARM_DB] = rte_cpu_to_be_32(doorbell_hi);
-	rte_wmb();
 	rte_write64(rte_cpu_to_be_64(doorbell), cq_db_reg);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 7/8] net/mlx5: replace IO memory barrier with DMA memory barrier
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (5 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c     | 4 ++--
 drivers/net/mlx5/mlx5_rxtx.h     | 2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 3b8f71c28..99a5f8681 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1898,9 +1898,9 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		return 0;
 	/* Update the consumer index. */
 	rxq->rq_ci = rq_ci >> sges_n;
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 #ifdef MLX5_PMD_SOFT_COUNTERS
 	/* Increment packets counter. */
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index a239642ac..480653f34 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -598,7 +598,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	uint64_t *dst = (uint64_t *)((uintptr_t)txq->bf_reg);
 	volatile uint64_t *src = ((volatile uint64_t *)wqe);
 
-	rte_io_wmb();
+	rte_dma_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
 	rte_wmb();
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index 7d7f016f1..9db1dddbe 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -135,7 +135,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 	elts_idx = rxq->rq_ci & q_mask;
 	for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
 		(*rxq->elts)[elts_idx + i] = &rxq->fake_mbuf;
-	rte_io_wmb();
+	rte_dma_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v3 8/8] net/mlx5: fix synchonization on polling Rx completions
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (6 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
@ 2018-01-19  0:44     ` Yongseok Koh
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
  8 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-19  0:44 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: dev, Yongseok Koh, stable

Polling a new packet is basically sensing the generation bit in a
completion entry. For some processors not having strongly-ordered memory
model, there has to be an IO memory barrier between reading the generation
bit and other fields of the entry in order to guarantee data is not stale.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
---
 drivers/net/mlx5/mlx5_rxtx.c          |  1 +
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 53 ++++++++++++++++++++---------------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h  |  2 +-
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 99a5f8681..8065d9d0b 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1669,6 +1669,7 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 			return 0;
 		++rxq->cq_ci;
 		op_own = cqe->op_own;
+		rte_dma_rmb();
 		if (MLX5_CQE_FORMAT(op_own) == MLX5_COMPRESSED) {
 			volatile struct mlx5_mini_cqe8 (*mc)[8] =
 				(volatile struct mlx5_mini_cqe8 (*)[8])
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index e11565f69..29ae933e7 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -814,6 +814,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		uint16x4_t mask;
 		uint16x4_t byte_cnt;
 		uint32x4_t ptype_info, flow_tag;
+		register uint64x2_t c0, c1, c2, c3;
 		uint8_t *p0, *p1, *p2, *p3;
 		uint8_t *e0 = (void *)&elts[pos]->pkt_len;
 		uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len;
@@ -830,6 +831,16 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe);
 		p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe);
 		p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe);
+		/* B.0 (CQE 3) load a block having op_own. */
+		c3 = vld1q_u64((uint64_t *)(p3 + 48));
+		/* B.0 (CQE 2) load a block having op_own. */
+		c2 = vld1q_u64((uint64_t *)(p2 + 48));
+		/* B.0 (CQE 1) load a block having op_own. */
+		c1 = vld1q_u64((uint64_t *)(p1 + 48));
+		/* B.0 (CQE 0) load a block having op_own. */
+		c0 = vld1q_u64((uint64_t *)(p0 + 48));
+		/* Synchronize for loading the rest of blocks. */
+		rte_dma_rmb();
 		/* Prefetch next 4 CQEs. */
 		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
 			unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;
@@ -839,50 +850,46 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 			rte_prefetch_non_temporal(&cq[next + 3]);
 		}
 		__asm__ volatile (
-		/* B.1 (CQE 3) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p3]] \n\t"
-		"sub %[p3], %[p3], #48 \n\t"
-		/* B.2 (CQE 3) load the rest blocks. */
+		/* B.1 (CQE 3) load the rest of blocks. */
 		"ld1 {v16.16b - v18.16b}, [%[p3]] \n\t"
+		/* B.2 (CQE 3) move the block having op_own. */
+		"mov v19.16b, %[c3].16b \n\t"
 		/* B.3 (CQE 3) extract 16B fields. */
 		"tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 2) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
 		/* B.4 (CQE 3) adjust CRC length. */
 		"sub v23.8h, v23.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 2) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p2]] \n\t"
-		"sub %[p2], %[p2], #48 \n\t"
 		/* C.1 (CQE 3) generate final structure for mbuf. */
 		"tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 2) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
+		/* B.2 (CQE 2) move the block having op_own. */
+		"mov v19.16b, %[c2].16b \n\t"
 		/* B.3 (CQE 2) extract 16B fields. */
 		"tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 1) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
 		/* B.4 (CQE 2) adjust CRC length. */
 		"sub v22.8h, v22.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 1) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p1]] \n\t"
-		"sub %[p1], %[p1], #48 \n\t"
 		/* C.1 (CQE 2) generate final structure for mbuf. */
 		"tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 1) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
+		/* B.2 (CQE 1) move the block having op_own. */
+		"mov v19.16b, %[c1].16b \n\t"
 		/* B.3 (CQE 1) extract 16B fields. */
 		"tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 0) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
 		/* B.4 (CQE 1) adjust CRC length. */
 		"sub v21.8h, v21.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 0) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p0]] \n\t"
-		"sub %[p0], %[p0], #48 \n\t"
 		/* C.1 (CQE 1) generate final structure for mbuf. */
 		"tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 0) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
+		/* B.2 (CQE 0) move the block having op_own. */
+		"mov v19.16b, %[c0].16b \n\t"
+		/* A.1 load mbuf pointers. */
+		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* B.3 (CQE 0) extract 16B fields. */
 		"tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
 		/* B.4 (CQE 0) adjust CRC length. */
 		"sub v20.8h, v20.8h, %[crc_adj].8h \n\t"
-		/* A.1 load mbuf pointers. */
-		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* D.1 extract op_own byte. */
 		"tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b \n\t"
 		/* C.2 (CQE 3) adjust flow mark. */
@@ -917,9 +924,9 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		 [byte_cnt]"=&w"(byte_cnt),
 		 [ptype_info]"=&w"(ptype_info),
 		 [flow_tag]"=&w"(flow_tag)
-		:[p3]"r"(p3 + 48), [p2]"r"(p2 + 48),
-		 [p1]"r"(p1 + 48), [p0]"r"(p0 + 48),
+		:[p3]"r"(p3), [p2]"r"(p2), [p1]"r"(p1), [p0]"r"(p0),
 		 [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
+		 [c3]"w"(c3), [c2]"w"(c2), [c1]"w"(c1), [c0]"w"(c0),
 		 [elts_p]"r"(elts_p),
 		 [pkts_p]"r"(pkts_p),
 		 [cqe_shuf_m]"w"(cqe_shuf_m),
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index 559b0237e..6c4d1c3d5 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -833,7 +833,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		/* B.2 copy mbuf pointers. */
 		_mm_storeu_si128((__m128i *)&pkts[pos], mbp1);
 		_mm_storeu_si128((__m128i *)&pkts[pos + 2], mbp2);
-		rte_compiler_barrier();
+		rte_dma_rmb();
 		/* C.1 load remained CQE data and extract necessary fields. */
 		cqe_tmp2 = _mm_load_si128((__m128i *)&cq[pos + p3]);
 		cqe_tmp1 = _mm_load_si128((__m128i *)&cq[pos + p2]);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 1/8] eal: " Yongseok Koh
@ 2018-01-19  7:16       ` Andrew Rybchenko
  2018-01-22 18:29         ` Yongseok Koh
  0 siblings, 1 reply; 49+ messages in thread
From: Andrew Rybchenko @ 2018-01-19  7:16 UTC (permalink / raw)
  To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro,
	bruce.richardson, konstantin.ananyev, chaozhu, jerin.jacob,
	jianbo.liu
  Cc: dev

On 01/19/2018 03:44 AM, Yongseok Koh wrote:
> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> guarantee the ordering of coherent shared memory between the CPU and a DMA
> capable device.
>
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

It is already really good. Many thanks.

Maybe it would be useful to:
  - avoid duplication of so long explanations (put in in one place and 
add reference?)
  - explain why it is bound to DMA or call it in a different way, since 
right now it is bound
    to coherent-mapped IO (rte_cio_rmb() ?). Yes, I see benefits to 
follow Linux
    terminology, but may be DPDK can do better :) I just add my 
concerns, but let
    EAL code maintainers to decide
  - as I understand right now there is no control over DMA mapping since
    mapping is done by UIO/VFIO drivers. Should documentation be updated 
that
    DMA mapping is assumed to be coherent?
  - when it is applied, may be it makes sense to send HEADS UP to dev@,
    it definitely deserves to be mentioned in the release notes

<...>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-19  7:16       ` Andrew Rybchenko
@ 2018-01-22 18:29         ` Yongseok Koh
  2018-01-22 20:59           ` Thomas Monjalon
  2018-01-23  4:35           ` Jerin Jacob
  0 siblings, 2 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-22 18:29 UTC (permalink / raw)
  To: Andrew Rybchenko, Thomas Monjalon, jianbo.liu, Jerin Jacob
  Cc: Adrien Mazarguil, Nélio Laranjeiro, bruce.richardson,
	Ananyev, Konstantin, Chao Zhu, dev


> On Jan 18, 2018, at 11:16 PM, Andrew Rybchenko <arybchenko@solarflare.com> wrote:
> 
> On 01/19/2018 03:44 AM, Yongseok Koh wrote:
>> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
>> guarantee the ordering of coherent shared memory between the CPU and a DMA
>> capable device.
>> 
>> Signed-off-by: Yongseok Koh 
>> <yskoh@mellanox.com>
> 
> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
> 
> It is already really good. Many thanks.

Thank you!

> Maybe it would be useful to:
>  - avoid duplication of so long explanations (put in in one place and add reference?)

May have to ask Thomas how to do this. Thomas?

>  - explain why it is bound to DMA or call it in a different way, since right now it is bound
>    to coherent-mapped IO (rte_cio_rmb() ?). Yes, I see benefits to follow Linux
>    terminology, but may be DPDK can do better :) I just add my concerns, but let
>    EAL code maintainers to decide

Good idea. Like to hear from other people. But, following linux terms sometime
could be good to welcome developers from kernel community to DPDK world. :-)

To people in the cc list, any other concerns?
Especially ARM users - Jianbo and Jerin?

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-22 18:29         ` Yongseok Koh
@ 2018-01-22 20:59           ` Thomas Monjalon
  2018-01-23  4:35           ` Jerin Jacob
  1 sibling, 0 replies; 49+ messages in thread
From: Thomas Monjalon @ 2018-01-22 20:59 UTC (permalink / raw)
  To: Yongseok Koh, Andrew Rybchenko
  Cc: dev, jianbo.liu, Jerin Jacob, Adrien Mazarguil,
	Nélio Laranjeiro, bruce.richardson, Ananyev, Konstantin,
	Chao Zhu

22/01/2018 19:29, Yongseok Koh:
> > On Jan 18, 2018, at 11:16 PM, Andrew Rybchenko <arybchenko@solarflare.com> wrote:
> > Maybe it would be useful to:
> >  - avoid duplication of so long explanations (put in in one place and add reference?)
> 
> May have to ask Thomas how to do this. Thomas?

You can group barriers by type (SMP, IO, DMA) with this doxygen syntax:
	https://www.stack.nl/~dimitri/doxygen/manual/grouping.html#memgroup
So you can have a common description of the group, plus a description
of each function.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-22 18:29         ` Yongseok Koh
  2018-01-22 20:59           ` Thomas Monjalon
@ 2018-01-23  4:35           ` Jerin Jacob
  2018-01-25 19:08             ` Yongseok Koh
  1 sibling, 1 reply; 49+ messages in thread
From: Jerin Jacob @ 2018-01-23  4:35 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: Andrew Rybchenko, Thomas Monjalon, jianbo.liu, Adrien Mazarguil,
	Nélio Laranjeiro, bruce.richardson, Ananyev, Konstantin,
	Chao Zhu, dev

-----Original Message-----
> Date: Mon, 22 Jan 2018 18:29:31 +0000
> From: Yongseok Koh <yskoh@mellanox.com>
> To: Andrew Rybchenko <arybchenko@solarflare.com>, Thomas Monjalon
>  <thomas@monjalon.net>, "jianbo.liu@arm.com" <jianbo.liu@arm.com>, Jerin
>  Jacob <jerin.jacob@caviumnetworks.com>
> CC: Adrien Mazarguil <adrien.mazarguil@6wind.com>, Nélio Laranjeiro
>  <nelio.laranjeiro@6wind.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "Ananyev, Konstantin"
>  <konstantin.ananyev@intel.com>, Chao Zhu <chaozhu@linux.vnet.ibm.com>,
>  "dev@dpdk.org" <dev@dpdk.org>
> Subject: Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
> 
> 
> > On Jan 18, 2018, at 11:16 PM, Andrew Rybchenko <arybchenko@solarflare.com> wrote:
> > 
> > On 01/19/2018 03:44 AM, Yongseok Koh wrote:
> >> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
> >> guarantee the ordering of coherent shared memory between the CPU and a DMA
> >> capable device.
> >> 
> >> Signed-off-by: Yongseok Koh 
> >> <yskoh@mellanox.com>
> > 
> > Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
> > 
> > It is already really good. Many thanks.
> 
> Thank you!
> 
> > Maybe it would be useful to:
> >  - avoid duplication of so long explanations (put in in one place and add reference?)
> 
> May have to ask Thomas how to do this. Thomas?
> 
> >  - explain why it is bound to DMA or call it in a different way, since right now it is bound
> >    to coherent-mapped IO (rte_cio_rmb() ?). Yes, I see benefits to follow Linux
> >    terminology, but may be DPDK can do better :) I just add my concerns, but let
> >    EAL code maintainers to decide
> 
> Good idea. Like to hear from other people. But, following linux terms sometime
> could be good to welcome developers from kernel community to DPDK world. :-)
> 
> To people in the cc list, any other concerns?
> Especially ARM users - Jianbo and Jerin?

I like Andrew's suggestion. IMO, rte_cio_?mb() makes more sense.

> 
> Thanks,
> Yongseok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
  2018-01-23  4:35           ` Jerin Jacob
@ 2018-01-25 19:08             ` Yongseok Koh
  0 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 19:08 UTC (permalink / raw)
  To: Jerin Jacob, Andrew Rybchenko
  Cc: Thomas Monjalon, jianbo.liu, Adrien Mazarguil,
	Nélio Laranjeiro, bruce.richardson, Ananyev, Konstantin,
	Chao Zhu, dev


> On Jan 22, 2018, at 8:35 PM, Jerin Jacob <jerin.jacob@caviumnetworks.com> wrote:
> 
> -----Original Message-----
>> Date: Mon, 22 Jan 2018 18:29:31 +0000
>> From: Yongseok Koh <yskoh@mellanox.com>
>> To: Andrew Rybchenko <arybchenko@solarflare.com>, Thomas Monjalon
>> <thomas@monjalon.net>, "jianbo.liu@arm.com" <jianbo.liu@arm.com>, Jerin
>> Jacob <jerin.jacob@caviumnetworks.com>
>> CC: Adrien Mazarguil <adrien.mazarguil@6wind.com>, Nélio Laranjeiro
>> <nelio.laranjeiro@6wind.com>, "bruce.richardson@intel.com"
>> <bruce.richardson@intel.com>, "Ananyev, Konstantin"
>> <konstantin.ananyev@intel.com>, Chao Zhu <chaozhu@linux.vnet.ibm.com>,
>> "dev@dpdk.org" <dev@dpdk.org>
>> Subject: Re: [dpdk-dev] [PATCH v3 1/8] eal: introduce DMA memory barriers
>> 
>> 
>>> On Jan 18, 2018, at 11:16 PM, Andrew Rybchenko <arybchenko@solarflare.com> wrote:
>>> 
>>> On 01/19/2018 03:44 AM, Yongseok Koh wrote:
>>>> This commit introduces rte_dma_wmb() and rte_dma_rmb(), in order to
>>>> guarantee the ordering of coherent shared memory between the CPU and a DMA
>>>> capable device.
>>>> 
>>>> Signed-off-by: Yongseok Koh 
>>>> <yskoh@mellanox.com>
>>> 
>>> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
>>> 
>>> It is already really good. Many thanks.
>> 
>> Thank you!
>> 
>>> Maybe it would be useful to:
>>> - avoid duplication of so long explanations (put in in one place and add reference?)
>> 
>> May have to ask Thomas how to do this. Thomas?
>> 
>>> - explain why it is bound to DMA or call it in a different way, since right now it is bound
>>>   to coherent-mapped IO (rte_cio_rmb() ?). Yes, I see benefits to follow Linux
>>>   terminology, but may be DPDK can do better :) I just add my concerns, but let
>>>   EAL code maintainers to decide
>> 
>> Good idea. Like to hear from other people. But, following linux terms sometime
>> could be good to welcome developers from kernel community to DPDK world. :-)
>> 
>> To people in the cc list, any other concerns?
>> Especially ARM users - Jianbo and Jerin?
> 
> I like Andrew's suggestion. IMO, rte_cio_?mb() makes more sense.

If there's no more suggestion or objection, will send out v3 with the changes requested here by Andrew.

Thank you for reviews and comments.

Yongseok


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers
  2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
                       ` (7 preceding siblings ...)
  2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
@ 2018-01-25 21:02     ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 1/9] eal: add Doxygen grouping for " Yongseok Koh
                         ` (9 more replies)
  8 siblings, 10 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

This patchset is to introduce coherent I/O memory barriers, which could be more
efficient for coherent memory between I/O device and CPU, especially for ARMv8.

v4:
* rename barriers to "coherent I/O memory barrier".
* Make groups for various barriers in Doxygen doc.

v3:
* add more detailed comments about the new memory barriers.

v2:
* introduce DMA memory barriers.

Yongseok Koh (9):
  eal: add Doxygen grouping for memory barriers
  eal: introduce coherent I/O memory barriers
  eal/x86: define coherent I/O memory barriers
  eal/ppc64: define coherent I/O memory barriers
  eal/armv7: define coherent I/O memory barriers
  eal/arm64: define coherent I/O memory barriers
  net/mlx5: remove unnecessary memory barrier
  net/mlx5: replace I/O memory barrier with coherent version
  net/mlx5: fix synchronization on polling Rx completions

 drivers/net/mlx5/mlx5_rxq.c                        |  1 -
 drivers/net/mlx5/mlx5_rxtx.c                       |  5 +-
 drivers/net/mlx5/mlx5_rxtx.h                       |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h                   |  2 +-
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h              | 53 ++++++++++++----------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h               |  2 +-
 .../common/include/arch/arm/rte_atomic_32.h        |  4 ++
 .../common/include/arch/arm/rte_atomic_64.h        |  4 ++
 .../common/include/arch/ppc_64/rte_atomic.h        |  4 ++
 .../common/include/arch/x86/rte_atomic.h           |  4 ++
 lib/librte_eal/common/include/generic/rte_atomic.h | 51 +++++++++++++++++++++
 11 files changed, 103 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 1/9] eal: add Doxygen grouping for memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 2/9] eal: introduce coherent I/O " Yongseok Koh
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/generic/rte_atomic.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 3ba7245a3..58c40489b 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -17,6 +17,9 @@
 
 #ifdef __DOXYGEN__
 
+/** @name Memory Barrier
+ */
+///@{
 /**
  * General memory barrier.
  *
@@ -43,7 +46,11 @@ static inline void rte_wmb(void);
  * This function is architecture dependent.
  */
 static inline void rte_rmb(void);
+///@}
 
+/** @name SMP Memory Barrier
+ */
+///@{
 /**
  * General memory barrier between lcores
  *
@@ -70,7 +77,11 @@ static inline void rte_smp_wmb(void);
  * before the LOAD operations that follows it.
  */
 static inline void rte_smp_rmb(void);
+///@}
 
+/** @name I/O Memory Barrier
+ */
+///@{
 /**
  * General memory barrier for I/O device
  *
@@ -97,6 +108,7 @@ static inline void rte_io_wmb(void);
  * operations that follow it.
  */
 static inline void rte_io_rmb(void);
+///@}
 
 #endif /* __DOXYGEN__ */
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 2/9] eal: introduce coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 1/9] eal: add Doxygen grouping for " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 3/9] eal/x86: define " Yongseok Koh
                         ` (7 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

This commit introduces rte_cio_wmb() and rte_cio_rmb(), in order to
guarantee the ordering of coherent shared memory between the CPU and a DMA
capable device.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 lib/librte_eal/common/include/generic/rte_atomic.h | 39 ++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 58c40489b..50e1b8a4d 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -110,6 +110,45 @@ static inline void rte_io_wmb(void);
 static inline void rte_io_rmb(void);
 ///@}
 
+/** @name Coherent I/O Memory Barrier
+ *
+ * Coherent I/O memory barrier is a lightweight version of I/O memory
+ * barriers which are system-wide data synchronization barriers. This
+ * is for only coherent memory domain between lcore and I/O device but
+ * it is same as the I/O memory barriers in most of architectures.
+ * However, some architecture provides even lighter barriers which are
+ * somewhere in between I/O memory barriers and SMP memory barriers.
+ * For example, in case of ARMv8, DMB(data memory barrier) instruction
+ * can have different shareability domains - inner-shareable and
+ * outer-shareable. And inner-shareable DMB fits for SMP memory
+ * barriers and outer-shareable DMB for coherent I/O memory barriers,
+ * which acts on coherent memory.
+ *
+ * In most cases, I/O memory barriers are safer but if operations are
+ * on coherent memory instead of incoherent MMIO region of a device,
+ * then coherent I/O memory barriers can be used and this could bring
+ * performance gain depending on architectures.
+ */
+///@{
+/**
+ * Write memory barrier for coherent memory between lcore and I/O device
+ *
+ * Guarantees that the STORE operations on coherent memory that
+ * precede the rte_cio_wmb() call are visible to I/O device before the
+ * STORE operations that follow it.
+ */
+static inline void rte_cio_wmb(void);
+
+/**
+ * Read memory barrier for coherent memory between lcore and I/O device
+ *
+ * Guarantees that the LOAD operations on coherent memory updated by
+ * I/O device that precede the rte_cio_rmb() call are visible to CPU
+ * before the LOAD operations that follow it.
+ */
+static inline void rte_cio_rmb(void);
+///@}
+
 #endif /* __DOXYGEN__ */
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 3/9] eal/x86: define coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 1/9] eal: add Doxygen grouping for " Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 2/9] eal: introduce coherent I/O " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 4/9] eal/ppc64: " Yongseok Koh
                         ` (6 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/x86/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic.h b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
index 36cfabc38..8fb796c63 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic.h
@@ -39,6 +39,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_compiler_barrier()
 
+#define rte_cio_wmb() rte_compiler_barrier()
+
+#define rte_cio_rmb() rte_compiler_barrier()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 
 #ifndef RTE_FORCE_INTRINSICS
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 4/9] eal/ppc64: define coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (2 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 3/9] eal/x86: define " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 5/9] eal/armv7: " Yongseok Koh
                         ` (5 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
index 150810cdb..f38618f90 100644
--- a/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
+++ b/lib/librte_eal/common/include/arch/ppc_64/rte_atomic.h
@@ -93,6 +93,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_cio_wmb() rte_wmb()
+
+#define rte_cio_rmb() rte_rmb()
+
 /*------------------------- 16 bit atomic operations -------------------------*/
 /* To be compatible with Power7, use GCC built-in functions for 16 bit
  * operations */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 5/9] eal/armv7: define coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (3 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 4/9] eal/ppc64: " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 6/9] eal/arm64: " Yongseok Koh
                         ` (4 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Jianbo Liu <jianbo.liu@arm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_32.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
index 14c048640..d2b7fa20f 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_32.h
@@ -79,6 +79,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_cio_wmb() rte_wmb()
+
+#define rte_cio_rmb() rte_rmb()
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 6/9] eal/arm64: define coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (4 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 5/9] eal/armv7: " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 7/9] net/mlx5: remove unnecessary memory barrier Yongseok Koh
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh, Thomas Speier

Cc: Thomas Speier <tspeier@qti.qualcomm.com>

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Thomas Speier <tspeier@qti.qualcomm.com>
Acked-by: Jianbo Liu <jianbo.liu@arm.com>
---
 lib/librte_eal/common/include/arch/arm/rte_atomic_64.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index b6bbd0b32..ee0d0d15a 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -36,6 +36,10 @@ extern "C" {
 
 #define rte_io_rmb() rte_rmb()
 
+#define rte_cio_wmb() dmb(oshst)
+
+#define rte_cio_rmb() dmb(oshld)
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 7/9] net/mlx5: remove unnecessary memory barrier
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (5 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 6/9] eal/arm64: " Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 8/9] net/mlx5: replace I/O memory barrier with coherent version Yongseok Koh
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

As rte_write64() has an IO barrier, there's no need to have a barrier
before the call.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c
index 3c716b960..ca6cd3ae6 100644
--- a/drivers/net/mlx5/mlx5_rxq.c
+++ b/drivers/net/mlx5/mlx5_rxq.c
@@ -517,7 +517,6 @@ mlx5_arm_cq(struct mlx5_rxq_data *rxq, int sq_n_rxq)
 	doorbell = (uint64_t)doorbell_hi << 32;
 	doorbell |=  rxq->cqn;
 	rxq->cq_db[MLX5_CQ_ARM_DB] = rte_cpu_to_be_32(doorbell_hi);
-	rte_wmb();
 	rte_write64(rte_cpu_to_be_64(doorbell), cq_db_reg);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 8/9] net/mlx5: replace I/O memory barrier with coherent version
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (6 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 7/9] net/mlx5: remove unnecessary memory barrier Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 9/9] net/mlx5: fix synchronization on polling Rx completions Yongseok Koh
  2018-01-28  7:32       ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Thomas Monjalon
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---
 drivers/net/mlx5/mlx5_rxtx.c     | 4 ++--
 drivers/net/mlx5/mlx5_rxtx.h     | 2 +-
 drivers/net/mlx5/mlx5_rxtx_vec.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 3b8f71c28..7a24d671d 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1898,9 +1898,9 @@ mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		return 0;
 	/* Update the consumer index. */
 	rxq->rq_ci = rq_ci >> sges_n;
-	rte_io_wmb();
+	rte_cio_wmb();
 	*rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci);
-	rte_io_wmb();
+	rte_cio_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 #ifdef MLX5_PMD_SOFT_COUNTERS
 	/* Increment packets counter. */
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index d4738b14c..30dfaf359 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -603,7 +603,7 @@ mlx5_tx_dbrec_cond_wmb(struct mlx5_txq_data *txq, volatile struct mlx5_wqe *wqe,
 	uint64_t *dst = (uint64_t *)((uintptr_t)txq->bf_reg);
 	volatile uint64_t *src = ((volatile uint64_t *)wqe);
 
-	rte_io_wmb();
+	rte_cio_wmb();
 	*txq->qp_db = rte_cpu_to_be_32(txq->wqe_ci);
 	/* Ensure ordering between DB record and BF copy. */
 	rte_wmb();
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.h b/drivers/net/mlx5/mlx5_rxtx_vec.h
index 7d7f016f1..be133a481 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.h
@@ -135,7 +135,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq, uint16_t n)
 	elts_idx = rxq->rq_ci & q_mask;
 	for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i)
 		(*rxq->elts)[elts_idx + i] = &rxq->fake_mbuf;
-	rte_io_wmb();
+	rte_cio_wmb();
 	*rxq->rq_db = rte_cpu_to_be_32(rxq->rq_ci);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [dpdk-dev] [PATCH v4 9/9] net/mlx5: fix synchronization on polling Rx completions
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (7 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 8/9] net/mlx5: replace I/O memory barrier with coherent version Yongseok Koh
@ 2018-01-25 21:02       ` Yongseok Koh
  2018-01-28  7:32       ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Thomas Monjalon
  9 siblings, 0 replies; 49+ messages in thread
From: Yongseok Koh @ 2018-01-25 21:02 UTC (permalink / raw)
  To: adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu
  Cc: arybchenko, dev, Yongseok Koh, stable

Polling a new packet is basically sensing the generation bit in a
completion entry. For some processors not having strongly-ordered memory
model, there has to be a memory barrier between reading the generation bit
and other fields of the entry in order to guarantee data is not stale.

Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM")
Cc: stable@dpdk.org

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Shahaf Shuler <shahafs@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
---
 drivers/net/mlx5/mlx5_rxtx.c          |  1 +
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 53 ++++++++++++++++++++---------------
 drivers/net/mlx5/mlx5_rxtx_vec_sse.h  |  2 +-
 3 files changed, 32 insertions(+), 24 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 7a24d671d..8e46361d7 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -1669,6 +1669,7 @@ mlx5_rx_poll_len(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 			return 0;
 		++rxq->cq_ci;
 		op_own = cqe->op_own;
+		rte_cio_rmb();
 		if (MLX5_CQE_FORMAT(op_own) == MLX5_COMPRESSED) {
 			volatile struct mlx5_mini_cqe8 (*mc)[8] =
 				(volatile struct mlx5_mini_cqe8 (*)[8])
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index e11565f69..29ecedada 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -814,6 +814,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		uint16x4_t mask;
 		uint16x4_t byte_cnt;
 		uint32x4_t ptype_info, flow_tag;
+		register uint64x2_t c0, c1, c2, c3;
 		uint8_t *p0, *p1, *p2, *p3;
 		uint8_t *e0 = (void *)&elts[pos]->pkt_len;
 		uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len;
@@ -830,6 +831,16 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe);
 		p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe);
 		p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe);
+		/* B.0 (CQE 3) load a block having op_own. */
+		c3 = vld1q_u64((uint64_t *)(p3 + 48));
+		/* B.0 (CQE 2) load a block having op_own. */
+		c2 = vld1q_u64((uint64_t *)(p2 + 48));
+		/* B.0 (CQE 1) load a block having op_own. */
+		c1 = vld1q_u64((uint64_t *)(p1 + 48));
+		/* B.0 (CQE 0) load a block having op_own. */
+		c0 = vld1q_u64((uint64_t *)(p0 + 48));
+		/* Synchronize for loading the rest of blocks. */
+		rte_cio_rmb();
 		/* Prefetch next 4 CQEs. */
 		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
 			unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;
@@ -839,50 +850,46 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 			rte_prefetch_non_temporal(&cq[next + 3]);
 		}
 		__asm__ volatile (
-		/* B.1 (CQE 3) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p3]] \n\t"
-		"sub %[p3], %[p3], #48 \n\t"
-		/* B.2 (CQE 3) load the rest blocks. */
+		/* B.1 (CQE 3) load the rest of blocks. */
 		"ld1 {v16.16b - v18.16b}, [%[p3]] \n\t"
+		/* B.2 (CQE 3) move the block having op_own. */
+		"mov v19.16b, %[c3].16b \n\t"
 		/* B.3 (CQE 3) extract 16B fields. */
 		"tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 2) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
 		/* B.4 (CQE 3) adjust CRC length. */
 		"sub v23.8h, v23.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 2) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p2]] \n\t"
-		"sub %[p2], %[p2], #48 \n\t"
 		/* C.1 (CQE 3) generate final structure for mbuf. */
 		"tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 2) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p2]] \n\t"
+		/* B.2 (CQE 2) move the block having op_own. */
+		"mov v19.16b, %[c2].16b \n\t"
 		/* B.3 (CQE 2) extract 16B fields. */
 		"tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 1) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
 		/* B.4 (CQE 2) adjust CRC length. */
 		"sub v22.8h, v22.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 1) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p1]] \n\t"
-		"sub %[p1], %[p1], #48 \n\t"
 		/* C.1 (CQE 2) generate final structure for mbuf. */
 		"tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 1) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p1]] \n\t"
+		/* B.2 (CQE 1) move the block having op_own. */
+		"mov v19.16b, %[c1].16b \n\t"
 		/* B.3 (CQE 1) extract 16B fields. */
 		"tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
+		/* B.1 (CQE 0) load the rest of blocks. */
+		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
 		/* B.4 (CQE 1) adjust CRC length. */
 		"sub v21.8h, v21.8h, %[crc_adj].8h \n\t"
-		/* B.1 (CQE 0) load a block having op_own. */
-		"ld1 {v19.16b}, [%[p0]] \n\t"
-		"sub %[p0], %[p0], #48 \n\t"
 		/* C.1 (CQE 1) generate final structure for mbuf. */
 		"tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b \n\t"
-		/* B.2 (CQE 0) load the rest blocks. */
-		"ld1 {v16.16b - v18.16b}, [%[p0]] \n\t"
+		/* B.2 (CQE 0) move the block having op_own. */
+		"mov v19.16b, %[c0].16b \n\t"
+		/* A.1 load mbuf pointers. */
+		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* B.3 (CQE 0) extract 16B fields. */
 		"tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b \n\t"
 		/* B.4 (CQE 0) adjust CRC length. */
 		"sub v20.8h, v20.8h, %[crc_adj].8h \n\t"
-		/* A.1 load mbuf pointers. */
-		"ld1 {v24.2d - v25.2d}, [%[elts_p]] \n\t"
 		/* D.1 extract op_own byte. */
 		"tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b \n\t"
 		/* C.2 (CQE 3) adjust flow mark. */
@@ -917,9 +924,9 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		 [byte_cnt]"=&w"(byte_cnt),
 		 [ptype_info]"=&w"(ptype_info),
 		 [flow_tag]"=&w"(flow_tag)
-		:[p3]"r"(p3 + 48), [p2]"r"(p2 + 48),
-		 [p1]"r"(p1 + 48), [p0]"r"(p0 + 48),
+		:[p3]"r"(p3), [p2]"r"(p2), [p1]"r"(p1), [p0]"r"(p0),
 		 [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0),
+		 [c3]"w"(c3), [c2]"w"(c2), [c1]"w"(c1), [c0]"w"(c0),
 		 [elts_p]"r"(elts_p),
 		 [pkts_p]"r"(pkts_p),
 		 [cqe_shuf_m]"w"(cqe_shuf_m),
diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
index 559b0237e..df66c2fbd 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h
@@ -833,7 +833,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n,
 		/* B.2 copy mbuf pointers. */
 		_mm_storeu_si128((__m128i *)&pkts[pos], mbp1);
 		_mm_storeu_si128((__m128i *)&pkts[pos + 2], mbp2);
-		rte_compiler_barrier();
+		rte_cio_rmb();
 		/* C.1 load remained CQE data and extract necessary fields. */
 		cqe_tmp2 = _mm_load_si128((__m128i *)&cq[pos + p3]);
 		cqe_tmp1 = _mm_load_si128((__m128i *)&cq[pos + p2]);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers
  2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
                         ` (8 preceding siblings ...)
  2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 9/9] net/mlx5: fix synchronization on polling Rx completions Yongseok Koh
@ 2018-01-28  7:32       ` Thomas Monjalon
  9 siblings, 0 replies; 49+ messages in thread
From: Thomas Monjalon @ 2018-01-28  7:32 UTC (permalink / raw)
  To: Yongseok Koh
  Cc: dev, adrien.mazarguil, nelio.laranjeiro, bruce.richardson,
	konstantin.ananyev, chaozhu, jerin.jacob, jianbo.liu, arybchenko,
	shahafs

25/01/2018 22:02, Yongseok Koh:
> This patchset is to introduce coherent I/O memory barriers, which could be more
> efficient for coherent memory between I/O device and CPU, especially for ARMv8.
> 
> v4:
> * rename barriers to "coherent I/O memory barrier".
> * Make groups for various barriers in Doxygen doc.
> 
> v3:
> * add more detailed comments about the new memory barriers.
> 
> v2:
> * introduce DMA memory barriers.
> 
> Yongseok Koh (9):
>   eal: add Doxygen grouping for memory barriers
>   eal: introduce coherent I/O memory barriers
>   eal/x86: define coherent I/O memory barriers
>   eal/ppc64: define coherent I/O memory barriers
>   eal/armv7: define coherent I/O memory barriers
>   eal/arm64: define coherent I/O memory barriers
>   net/mlx5: remove unnecessary memory barrier
>   net/mlx5: replace I/O memory barrier with coherent version
>   net/mlx5: fix synchronization on polling Rx completions

Applied, thanks

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2018-01-28  7:33 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-27  4:28 [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Yongseok Koh
2017-12-27  4:28 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
2018-01-04 12:58 ` [dpdk-dev] [PATCH 1/2] eal/arm64: modify I/O device memory barriers Jerin Jacob
2018-01-08  1:55 ` Jianbo Liu
2018-01-16  0:42   ` Yongseok Koh
2018-01-16  1:10 ` [dpdk-dev] [PATCH v2 0/8] introduce DMA " Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 1/8] eal: " Yongseok Koh
2018-01-16  2:47     ` Jianbo Liu
2018-01-16  7:49     ` Andrew Rybchenko
2018-01-16  9:10       ` Jianbo Liu
2018-01-17 13:46         ` Thomas Monjalon
2018-01-17 18:39           ` Yongseok Koh
2018-01-18 11:56             ` Andrew Rybchenko
2018-01-18 18:14               ` Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 2/8] eal/x86: define " Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 3/8] eal/ppc64: define DMA device " Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 4/8] eal/armv7: define DMA " Yongseok Koh
2018-01-16  2:48     ` Jianbo Liu
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 5/8] eal/arm64: " Yongseok Koh
2018-01-16  2:50     ` Jianbo Liu
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
2018-01-16  1:10   ` [dpdk-dev] [PATCH v2 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
2018-01-16  3:53     ` Jianbo Liu
2018-01-19  0:44   ` [dpdk-dev] [PATCH v3 0/8] introduce DMA memory barriers Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 1/8] eal: " Yongseok Koh
2018-01-19  7:16       ` Andrew Rybchenko
2018-01-22 18:29         ` Yongseok Koh
2018-01-22 20:59           ` Thomas Monjalon
2018-01-23  4:35           ` Jerin Jacob
2018-01-25 19:08             ` Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 2/8] eal/x86: define " Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 3/8] eal/ppc64: " Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 4/8] eal/armv7: " Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 5/8] eal/arm64: " Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 6/8] net/mlx5: remove unnecessary memory barrier Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 7/8] net/mlx5: replace IO memory barrier with DMA " Yongseok Koh
2018-01-19  0:44     ` [dpdk-dev] [PATCH v3 8/8] net/mlx5: fix synchonization on polling Rx completions Yongseok Koh
2018-01-25 21:02     ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 1/9] eal: add Doxygen grouping for " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 2/9] eal: introduce coherent I/O " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 3/9] eal/x86: define " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 4/9] eal/ppc64: " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 5/9] eal/armv7: " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 6/9] eal/arm64: " Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 7/9] net/mlx5: remove unnecessary memory barrier Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 8/9] net/mlx5: replace I/O memory barrier with coherent version Yongseok Koh
2018-01-25 21:02       ` [dpdk-dev] [PATCH v4 9/9] net/mlx5: fix synchronization on polling Rx completions Yongseok Koh
2018-01-28  7:32       ` [dpdk-dev] [PATCH v4 0/9] introduce coherent I/O memory barriers Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).