DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
@ 2014-05-20  9:36 Olivier Matz
  2014-05-20 10:05 ` Ananyev, Konstantin
  0 siblings, 1 reply; 7+ messages in thread
From: Olivier Matz @ 2014-05-20  9:36 UTC (permalink / raw)
  To: dev

This commit introduce rte_smp_mb(), rte_smp_wmb() and rte_smp_rmb(), in
order to differentiate memory barriers used between lcores, and memory
barriers used between a lcore and a device. The patch does not provide
any functional change, the goal is to have more explicit call, making
the code more readable.

Note that on x86 CPUs, memory barriers between different cores can be
guaranted by a simple compiler barrier. If the memory is also accessed
by a device, a real memory barrier instruction has to be used.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c                       |  2 +-
 lib/librte_eal/bsdapp/eal/eal_thread.c     |  2 +-
 lib/librte_eal/common/eal_common_launch.c  |  2 +-
 lib/librte_eal/common/include/rte_atomic.h | 24 ++++++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_thread.c   |  2 +-
 lib/librte_pmd_virtio/virtqueue.h          |  6 +++---
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c   |  4 ++--
 lib/librte_pmd_xenvirt/virtqueue.h         |  2 +-
 lib/librte_ring/rte_ring.h                 |  8 ++++----
 lib/librte_timer/rte_timer.c               |  8 ++++----
 10 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index ba2f2d9..7945069 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -603,7 +603,7 @@ test_refcnt_master(void)
 		test_refcnt_iter(lcore, i);
 
 	refcnt_stop_slaves = 1;
-	rte_wmb();
+	rte_smp_wmb();
 
 	printf("%s finished at lcore %u\n", __func__, lcore);
 	return (0);
diff --git a/lib/librte_eal/bsdapp/eal/eal_thread.c b/lib/librte_eal/bsdapp/eal/eal_thread.c
index d2bec2e..00e1647 100644
--- a/lib/librte_eal/bsdapp/eal/eal_thread.c
+++ b/lib/librte_eal/bsdapp/eal/eal_thread.c
@@ -223,7 +223,7 @@ eal_thread_loop(__attribute__((unused)) void *arg)
 		fct_arg = lcore_config[lcore_id].arg;
 		ret = lcore_config[lcore_id].f(fct_arg);
 		lcore_config[lcore_id].ret = ret;
-		rte_wmb();
+		rte_smp_wmb();
 		lcore_config[lcore_id].state = FINISHED;
 	}
 
diff --git a/lib/librte_eal/common/eal_common_launch.c b/lib/librte_eal/common/eal_common_launch.c
index 6f7c696..a3cc282 100644
--- a/lib/librte_eal/common/eal_common_launch.c
+++ b/lib/librte_eal/common/eal_common_launch.c
@@ -57,7 +57,7 @@ rte_eal_wait_lcore(unsigned slave_id)
 	while (lcore_config[slave_id].state != WAIT &&
 	       lcore_config[slave_id].state != FINISHED);
 
-	rte_rmb();
+	rte_smp_rmb();
 
 	/* we are in finished state, go to wait state */
 	lcore_config[slave_id].state = WAIT;
diff --git a/lib/librte_eal/common/include/rte_atomic.h b/lib/librte_eal/common/include/rte_atomic.h
index ef95160..4056e53 100644
--- a/lib/librte_eal/common/include/rte_atomic.h
+++ b/lib/librte_eal/common/include/rte_atomic.h
@@ -82,6 +82,30 @@ extern "C" {
 #define	rte_rmb() _mm_lfence()
 
 /**
+ * General memory barrier between 2 lcore.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_mb() rte_compiler_barrier()
+
+/**
+ * Write memory barrier between 2 lcores.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_wmb() rte_compiler_barrier()
+
+/**
+ * Read memory barrier between 2 lcores.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_rmb() rte_compiler_barrier()
+
+/**
  * Compiler barrier.
  *
  * Guarantees that operation reordering does not occur at compile time 
diff --git a/lib/librte_eal/linuxapp/eal/eal_thread.c b/lib/librte_eal/linuxapp/eal/eal_thread.c
index bf77873..2d839e6 100644
--- a/lib/librte_eal/linuxapp/eal/eal_thread.c
+++ b/lib/librte_eal/linuxapp/eal/eal_thread.c
@@ -223,7 +223,7 @@ eal_thread_loop(__attribute__((unused)) void *arg)
 		fct_arg = lcore_config[lcore_id].arg;
 		ret = lcore_config[lcore_id].f(fct_arg);
 		lcore_config[lcore_id].ret = ret;
-		rte_wmb();
+		rte_smp_wmb();
 		lcore_config[lcore_id].state = FINISHED;
 	}
 
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index 11f908c..f45dd47 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -46,9 +46,9 @@
 #include "virtio_ring.h"
 #include "virtio_logs.h"
 
-#define mb()  rte_mb()
-#define wmb() rte_wmb()
-#define rmb() rte_rmb()
+#define mb()  rte_smp_mb()
+#define wmb() rte_smp_wmb()
+#define rmb() rte_smp_rmb()
 
 #ifdef RTE_PMD_PACKET_PREFETCH
 #define rte_packet_prefetch(p)  rte_prefetch1(p)
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 4f8b712..7ebde3b 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -98,7 +98,7 @@ eth_xenvirt_rx(void *q, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 
 	nb_used = VIRTQUEUE_NUSED(rxvq);
 
-	rte_compiler_barrier(); /* rmb */
+	rte_smp_rmb();
 	num = (uint16_t)(likely(nb_used <= nb_pkts) ? nb_used : nb_pkts);
 	num = (uint16_t)(likely(num <= VIRTIO_MBUF_BURST_SZ) ? num : VIRTIO_MBUF_BURST_SZ);
 	if (unlikely(num == 0)) return 0;
@@ -149,7 +149,7 @@ eth_xenvirt_tx(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
 	nb_used = VIRTQUEUE_NUSED(txvq);
 
-	rte_compiler_barrier();   /* rmb */
+	rte_smp_rmb();
 
 	num = (uint16_t)(likely(nb_used <= VIRTIO_MBUF_BURST_SZ) ? nb_used : VIRTIO_MBUF_BURST_SZ);
 	num = virtqueue_dequeue_burst(txvq, snd_pkts, len, num);
diff --git a/lib/librte_pmd_xenvirt/virtqueue.h b/lib/librte_pmd_xenvirt/virtqueue.h
index f36030f..59b4469 100644
--- a/lib/librte_pmd_xenvirt/virtqueue.h
+++ b/lib/librte_pmd_xenvirt/virtqueue.h
@@ -150,7 +150,7 @@ vq_ring_update_avail(struct virtqueue *vq, uint16_t desc_idx)
 	 */
 	avail_idx = (uint16_t)(vq->vq_ring.avail->idx & (vq->vq_nentries - 1));
 	vq->vq_ring.avail->ring[avail_idx] = desc_idx;
-	rte_compiler_barrier();  /* wmb , for IA memory model barrier is enough*/
+	rte_smp_wmb();
 	vq->vq_ring.avail->idx++;
 }
 
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index da54e34..1cbf8a9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -451,7 +451,7 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 
 	/* write entries in ring */
 	ENQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_wmb();
 
 	/* if we exceed the watermark */
 	if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
@@ -537,7 +537,7 @@ __rte_ring_sp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 
 	/* write entries in ring */
 	ENQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_wmb();
 
 	/* if we exceed the watermark */
 	if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
@@ -628,7 +628,7 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
 
 	/* copy in table */
 	DEQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_rmb();
 
 	/*
 	 * If there are other dequeues in progress that preceded us,
@@ -703,7 +703,7 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void **obj_table,
 
 	/* copy in table */
 	DEQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_rmb();
 
 	__RING_STAT_ADD(r, deq_success, n);
 	r->cons.tail = cons_next;
diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
index 884ee0e..520360e 100755
--- a/lib/librte_timer/rte_timer.c
+++ b/lib/librte_timer/rte_timer.c
@@ -396,7 +396,7 @@ __rte_timer_reset(struct rte_timer *tim, uint64_t expire,
 
 	/* update state: as we are in CONFIG state, only us can modify
 	 * the state so we don't need to use cmpset() here */
-	rte_wmb();
+	rte_smp_wmb();
 	status.state = RTE_TIMER_PENDING;
 	status.owner = (int16_t)tim_lcore;
 	tim->status.u32 = status.u32;
@@ -462,7 +462,7 @@ rte_timer_stop(struct rte_timer *tim)
 	}
 
 	/* mark timer as stopped */
-	rte_wmb();
+	rte_smp_wmb();
 	status.state = RTE_TIMER_STOP;
 	status.owner = RTE_TIMER_NO_OWNER;
 	tim->status.u32 = status.u32;
@@ -558,14 +558,14 @@ void rte_timer_manage(void)
 			__TIMER_STAT_ADD(pending, -1);
 			status.state = RTE_TIMER_STOP;
 			status.owner = RTE_TIMER_NO_OWNER;
-			rte_wmb();
+			rte_smp_wmb();
 			tim->status.u32 = status.u32;
 		}
 		else {
 			/* keep it in list and mark timer as pending */
 			status.state = RTE_TIMER_PENDING;
 			status.owner = (int16_t)lcore_id;
-			rte_wmb();
+			rte_smp_wmb();
 			tim->status.u32 = status.u32;
 			__rte_timer_reset(tim, cur_time + tim->period,
 					tim->period, lcore_id, tim->f, tim->arg, 1);
-- 
1.9.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-20  9:36 [dpdk-dev] [PATCH] atomic: clarify use of memory barriers Olivier Matz
@ 2014-05-20 10:05 ` Ananyev, Konstantin
  2014-05-20 12:12   ` Olivier MATZ
  2014-05-23 14:10   ` Olivier MATZ
  0 siblings, 2 replies; 7+ messages in thread
From: Ananyev, Konstantin @ 2014-05-20 10:05 UTC (permalink / raw)
  To: Olivier Matz, dev

Hi Oliver,

> Note that on x86 CPUs, memory barriers between different cores can be guaranteed by a simple compiler barrier.

I don't think this is totally correct.
Yes, for  Intel cpus in many cases memory barrier could be avoided due to nearly strict memory ordering.
Though there are few cases where reordering is possible and when fence instructions would be needed.
 For me:
+#define	rte_smp_rmb() rte_compiler_barrier()

Seems a bit misleading, as there is no real fence.
So I suggest we keep rte_compiler_barrier() naming and usage.

Thanks
Konstantin

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Tuesday, May 20, 2014 10:36 AM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers

This commit introduce rte_smp_mb(), rte_smp_wmb() and rte_smp_rmb(), in
order to differentiate memory barriers used between lcores, and memory
barriers used between a lcore and a device. The patch does not provide
any functional change, the goal is to have more explicit call, making
the code more readable.

Note that on x86 CPUs, memory barriers between different cores can be
guaranted by a simple compiler barrier. If the memory is also accessed
by a device, a real memory barrier instruction has to be used.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c                       |  2 +-
 lib/librte_eal/bsdapp/eal/eal_thread.c     |  2 +-
 lib/librte_eal/common/eal_common_launch.c  |  2 +-
 lib/librte_eal/common/include/rte_atomic.h | 24 ++++++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/eal_thread.c   |  2 +-
 lib/librte_pmd_virtio/virtqueue.h          |  6 +++---
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c   |  4 ++--
 lib/librte_pmd_xenvirt/virtqueue.h         |  2 +-
 lib/librte_ring/rte_ring.h                 |  8 ++++----
 lib/librte_timer/rte_timer.c               |  8 ++++----
 10 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index ba2f2d9..7945069 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -603,7 +603,7 @@ test_refcnt_master(void)
 		test_refcnt_iter(lcore, i);
 
 	refcnt_stop_slaves = 1;
-	rte_wmb();
+	rte_smp_wmb();
 
 	printf("%s finished at lcore %u\n", __func__, lcore);
 	return (0);
diff --git a/lib/librte_eal/bsdapp/eal/eal_thread.c b/lib/librte_eal/bsdapp/eal/eal_thread.c
index d2bec2e..00e1647 100644
--- a/lib/librte_eal/bsdapp/eal/eal_thread.c
+++ b/lib/librte_eal/bsdapp/eal/eal_thread.c
@@ -223,7 +223,7 @@ eal_thread_loop(__attribute__((unused)) void *arg)
 		fct_arg = lcore_config[lcore_id].arg;
 		ret = lcore_config[lcore_id].f(fct_arg);
 		lcore_config[lcore_id].ret = ret;
-		rte_wmb();
+		rte_smp_wmb();
 		lcore_config[lcore_id].state = FINISHED;
 	}
 
diff --git a/lib/librte_eal/common/eal_common_launch.c b/lib/librte_eal/common/eal_common_launch.c
index 6f7c696..a3cc282 100644
--- a/lib/librte_eal/common/eal_common_launch.c
+++ b/lib/librte_eal/common/eal_common_launch.c
@@ -57,7 +57,7 @@ rte_eal_wait_lcore(unsigned slave_id)
 	while (lcore_config[slave_id].state != WAIT &&
 	       lcore_config[slave_id].state != FINISHED);
 
-	rte_rmb();
+	rte_smp_rmb();
 
 	/* we are in finished state, go to wait state */
 	lcore_config[slave_id].state = WAIT;
diff --git a/lib/librte_eal/common/include/rte_atomic.h b/lib/librte_eal/common/include/rte_atomic.h
index ef95160..4056e53 100644
--- a/lib/librte_eal/common/include/rte_atomic.h
+++ b/lib/librte_eal/common/include/rte_atomic.h
@@ -82,6 +82,30 @@ extern "C" {
 #define	rte_rmb() _mm_lfence()
 
 /**
+ * General memory barrier between 2 lcore.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_mb() rte_compiler_barrier()
+
+/**
+ * Write memory barrier between 2 lcores.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_wmb() rte_compiler_barrier()
+
+/**
+ * Read memory barrier between 2 lcores.
+ *
+ * On Intel CPUs, it's already guaranted by the hardware so only a
+ * compiler barrier is needed.
+ */
+#define	rte_smp_rmb() rte_compiler_barrier()
+
+/**
  * Compiler barrier.
  *
  * Guarantees that operation reordering does not occur at compile time 
diff --git a/lib/librte_eal/linuxapp/eal/eal_thread.c b/lib/librte_eal/linuxapp/eal/eal_thread.c
index bf77873..2d839e6 100644
--- a/lib/librte_eal/linuxapp/eal/eal_thread.c
+++ b/lib/librte_eal/linuxapp/eal/eal_thread.c
@@ -223,7 +223,7 @@ eal_thread_loop(__attribute__((unused)) void *arg)
 		fct_arg = lcore_config[lcore_id].arg;
 		ret = lcore_config[lcore_id].f(fct_arg);
 		lcore_config[lcore_id].ret = ret;
-		rte_wmb();
+		rte_smp_wmb();
 		lcore_config[lcore_id].state = FINISHED;
 	}
 
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index 11f908c..f45dd47 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -46,9 +46,9 @@
 #include "virtio_ring.h"
 #include "virtio_logs.h"
 
-#define mb()  rte_mb()
-#define wmb() rte_wmb()
-#define rmb() rte_rmb()
+#define mb()  rte_smp_mb()
+#define wmb() rte_smp_wmb()
+#define rmb() rte_smp_rmb()
 
 #ifdef RTE_PMD_PACKET_PREFETCH
 #define rte_packet_prefetch(p)  rte_prefetch1(p)
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 4f8b712..7ebde3b 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -98,7 +98,7 @@ eth_xenvirt_rx(void *q, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 
 	nb_used = VIRTQUEUE_NUSED(rxvq);
 
-	rte_compiler_barrier(); /* rmb */
+	rte_smp_rmb();
 	num = (uint16_t)(likely(nb_used <= nb_pkts) ? nb_used : nb_pkts);
 	num = (uint16_t)(likely(num <= VIRTIO_MBUF_BURST_SZ) ? num : VIRTIO_MBUF_BURST_SZ);
 	if (unlikely(num == 0)) return 0;
@@ -149,7 +149,7 @@ eth_xenvirt_tx(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
 	nb_used = VIRTQUEUE_NUSED(txvq);
 
-	rte_compiler_barrier();   /* rmb */
+	rte_smp_rmb();
 
 	num = (uint16_t)(likely(nb_used <= VIRTIO_MBUF_BURST_SZ) ? nb_used : VIRTIO_MBUF_BURST_SZ);
 	num = virtqueue_dequeue_burst(txvq, snd_pkts, len, num);
diff --git a/lib/librte_pmd_xenvirt/virtqueue.h b/lib/librte_pmd_xenvirt/virtqueue.h
index f36030f..59b4469 100644
--- a/lib/librte_pmd_xenvirt/virtqueue.h
+++ b/lib/librte_pmd_xenvirt/virtqueue.h
@@ -150,7 +150,7 @@ vq_ring_update_avail(struct virtqueue *vq, uint16_t desc_idx)
 	 */
 	avail_idx = (uint16_t)(vq->vq_ring.avail->idx & (vq->vq_nentries - 1));
 	vq->vq_ring.avail->ring[avail_idx] = desc_idx;
-	rte_compiler_barrier();  /* wmb , for IA memory model barrier is enough*/
+	rte_smp_wmb();
 	vq->vq_ring.avail->idx++;
 }
 
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index da54e34..1cbf8a9 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -451,7 +451,7 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 
 	/* write entries in ring */
 	ENQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_wmb();
 
 	/* if we exceed the watermark */
 	if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
@@ -537,7 +537,7 @@ __rte_ring_sp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 
 	/* write entries in ring */
 	ENQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_wmb();
 
 	/* if we exceed the watermark */
 	if (unlikely(((mask + 1) - free_entries + n) > r->prod.watermark)) {
@@ -628,7 +628,7 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
 
 	/* copy in table */
 	DEQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_rmb();
 
 	/*
 	 * If there are other dequeues in progress that preceded us,
@@ -703,7 +703,7 @@ __rte_ring_sc_do_dequeue(struct rte_ring *r, void **obj_table,
 
 	/* copy in table */
 	DEQUEUE_PTRS();
-	rte_compiler_barrier();
+	rte_smp_rmb();
 
 	__RING_STAT_ADD(r, deq_success, n);
 	r->cons.tail = cons_next;
diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
index 884ee0e..520360e 100755
--- a/lib/librte_timer/rte_timer.c
+++ b/lib/librte_timer/rte_timer.c
@@ -396,7 +396,7 @@ __rte_timer_reset(struct rte_timer *tim, uint64_t expire,
 
 	/* update state: as we are in CONFIG state, only us can modify
 	 * the state so we don't need to use cmpset() here */
-	rte_wmb();
+	rte_smp_wmb();
 	status.state = RTE_TIMER_PENDING;
 	status.owner = (int16_t)tim_lcore;
 	tim->status.u32 = status.u32;
@@ -462,7 +462,7 @@ rte_timer_stop(struct rte_timer *tim)
 	}
 
 	/* mark timer as stopped */
-	rte_wmb();
+	rte_smp_wmb();
 	status.state = RTE_TIMER_STOP;
 	status.owner = RTE_TIMER_NO_OWNER;
 	tim->status.u32 = status.u32;
@@ -558,14 +558,14 @@ void rte_timer_manage(void)
 			__TIMER_STAT_ADD(pending, -1);
 			status.state = RTE_TIMER_STOP;
 			status.owner = RTE_TIMER_NO_OWNER;
-			rte_wmb();
+			rte_smp_wmb();
 			tim->status.u32 = status.u32;
 		}
 		else {
 			/* keep it in list and mark timer as pending */
 			status.state = RTE_TIMER_PENDING;
 			status.owner = (int16_t)lcore_id;
-			rte_wmb();
+			rte_smp_wmb();
 			tim->status.u32 = status.u32;
 			__rte_timer_reset(tim, cur_time + tim->period,
 					tim->period, lcore_id, tim->f, tim->arg, 1);
-- 
1.9.2

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-20 10:05 ` Ananyev, Konstantin
@ 2014-05-20 12:12   ` Olivier MATZ
  2014-05-20 16:35     ` Ananyev, Konstantin
  2014-05-23 14:10   ` Olivier MATZ
  1 sibling, 1 reply; 7+ messages in thread
From: Olivier MATZ @ 2014-05-20 12:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hi Konstantin,

Thank you for your review and feedback.

On 05/20/2014 12:05 PM, Ananyev, Konstantin wrote:
>> Note that on x86 CPUs, memory barriers between different cores can be guaranteed by a simple compiler barrier.
>
> I don't think this is totally correct.
> Yes, for  Intel cpus in many cases memory barrier could be avoided due to nearly strict memory ordering.
> Though there are few cases where reordering is possible and when fence instructions would be needed.

I tried to mimic the behavior of linux that differentiates *mb() from
smp_*mb(), but I did too fast. In linux, we have [1]:

   smp_mb() = mb() = asm volatile("mfence":::"memory")
   smp_rmb() = compiler_barrier()
   smp_wmb() = compiler_barrier()

At least this should fixed in the patch. By the way, just for reference,
the idea of the patch came from a discussion we had on the list [2].

> For me:
> +#define	rte_smp_rmb() rte_compiler_barrier()
> Seems a bit misleading, as there is no real fence.
> So I suggest we keep rte_compiler_barrier() naming and usage.

The objectives of the patch (which was probably not explained very
clearly in the commit log) were:
- make the code more readable to distinguish between the 2 kinds of
   memory barrier.
- optimize some code to avoid a real memory barrier when not required
   (timers, virtio, ...)

Having a compiler barrier in place of a memory barrier in the code
does not really help to understand what the developper wanted to do.
In the current code we can see that the use of rte_compiler_barrier()
is ambiguous, as it need a comment to clarify the situation:

	rte_compiler_barrier();   /* rmb */

Don't you think we could fix the patch but keep its logic?

Regards,
Olivier

[1] http://lxr.free-electrons.com/source/arch/x86/include/asm/barrier.h#L81
[2] http://dpdk.org/ml/archives/dev/2014-March/001741.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-20 12:12   ` Olivier MATZ
@ 2014-05-20 16:35     ` Ananyev, Konstantin
  0 siblings, 0 replies; 7+ messages in thread
From: Ananyev, Konstantin @ 2014-05-20 16:35 UTC (permalink / raw)
  To: Olivier MATZ, dev

Hi Oliver,

>- optimize some code to avoid a real memory barrier when not required    (timers, virtio, ...)

That seems like a good thing to me.

> - make the code more readable to distinguish between the 2 kinds of memory barrier.

That part seems a bit misleading to me.
rte_compiler_barier() - is a barrier just for a compiler, not for real cpu. 
It only guarantees that the compiler wouldn't reorder instructions across it while emitting the code.

Looking at Intel Memory Ordering rules (Intel System PG, section 8.2):

1) Reads may be reordered with older writes to different locations but not with older writes to the same location.

So with the following fragment of code:
int a;
extern int *x, *y;
L0:
*y = 0;
rte_compiler_barrier();
L1:
 a = *x;

There is no guarantee that store at L0 will always be finished before load at L1.
Which means to me that rte_smp_mb() can't be identical to compiler_barrier, but should be real 'mfence' instruction instead.  

2) Writes to memory are not reordered with other writes, with the following exceptions:
   ...
   streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); 
   ...

So with the following fragment of code:
extern int *x;
extern  __128i a, *p;
L0: 
_mm_stream_si128( p, a);
rte_compiler_barrier();
L1:
*x = 0;

There is no guarantee that store at L0 will always be finished before store at L1.
Which means to me that rte_smp_wmb() can't be identical to compiler_barrier, but should be real 'sfence' instruction instead.  

The only replacement that seems safe to me is:
#define	rte_smp_rmb() rte_compiler_barrier()

But now, there seems a confusion: everyone has to remember that smp_mb() and smp_wmb() are 'real' fences, while smp_rmb() is not.
That's why my suggestion was to simply keep using compiler_barrier() for all cases, when we don't need real fence.

Thanks
Konstantin

-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Tuesday, May 20, 2014 1:13 PM
To: Ananyev, Konstantin; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers

Hi Konstantin,

Thank you for your review and feedback.

On 05/20/2014 12:05 PM, Ananyev, Konstantin wrote:
>> Note that on x86 CPUs, memory barriers between different cores can be guaranteed by a simple compiler barrier.
>
> I don't think this is totally correct.
> Yes, for  Intel cpus in many cases memory barrier could be avoided due to nearly strict memory ordering.
> Though there are few cases where reordering is possible and when fence instructions would be needed.

I tried to mimic the behavior of linux that differentiates *mb() from
smp_*mb(), but I did too fast. In linux, we have [1]:

   smp_mb() = mb() = asm volatile("mfence":::"memory")
   smp_rmb() = compiler_barrier()
   smp_wmb() = compiler_barrier()

At least this should fixed in the patch. By the way, just for reference,
the idea of the patch came from a discussion we had on the list [2].

> For me:
> +#define	rte_smp_rmb() rte_compiler_barrier()
> Seems a bit misleading, as there is no real fence.
> So I suggest we keep rte_compiler_barrier() naming and usage.

The objectives of the patch (which was probably not explained very
clearly in the commit log) were:
- make the code more readable to distinguish between the 2 kinds of
   memory barrier.
- optimize some code to avoid a real memory barrier when not required
   (timers, virtio, ...)

Having a compiler barrier in place of a memory barrier in the code
does not really help to understand what the developper wanted to do.
In the current code we can see that the use of rte_compiler_barrier()
is ambiguous, as it need a comment to clarify the situation:

	rte_compiler_barrier();   /* rmb */

Don't you think we could fix the patch but keep its logic?

Regards,
Olivier

[1] http://lxr.free-electrons.com/source/arch/x86/include/asm/barrier.h#L81
[2] http://dpdk.org/ml/archives/dev/2014-March/001741.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-20 10:05 ` Ananyev, Konstantin
  2014-05-20 12:12   ` Olivier MATZ
@ 2014-05-23 14:10   ` Olivier MATZ
  2014-05-26 13:57     ` Ananyev, Konstantin
  1 sibling, 1 reply; 7+ messages in thread
From: Olivier MATZ @ 2014-05-23 14:10 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hi Konstantin,

Thanks for these code examples and explanations.

On 05/20/2014 06:35 PM, Ananyev, Konstantin wrote:
 > So with the following fragment of code:
 > extern int *x;
 > extern  __128i a, *p;
 > L0:
 > _mm_stream_si128( p, a);
 > rte_compiler_barrier();
 > L1:
 > *x = 0;
 >
 > There is no guarantee that store at L0 will always be finished
 > before store at L1.

This code fragment looks very similar to what is done in
__rte_ring_sp_do_enqueue():

     [...]
     ENQUEUE_PTRS(); /* I expect it is converted to an SSE store */
     rte_compiler_barrier();
     [...]
     r->prod.tail = prod_next;

So, according to your previous explanation, I understand that
this code would require a write memory barrier in place of the
compiler barrier. Am I wrong?

If it's correct, we are back on the initial question: in this kind
of code, if the programmer wants that all stores are issued before
setting the value of r->prod.tail. This is the definition of the
write memory barrier. So wouldn't be better that he explicitelly
calls rte_smp_wmb() instead of adding a compiler barrier because
he knows that it is sufficient on all currently supported CPUs?
Can we be sure that next Intel CPU generations will behave that
way in the future?

Moreover, if I understand well, a real wmb() is needed only if
a SSE store is issued. But the programmer may not control that,
it's the job of the compiler.

 > But now, there seems a confusion: everyone has to remember that
 > smp_mb() and smp_wmb() are 'real' fences, while smp_rmb() is not.
 > That's why my suggestion was to simply keep using compiler_barrier()
 > for all cases, when we don't need real fence.

I'm not sure the programmer has to know which smp_*mb() is a real fence
or not. He just expects that it generates the proper CPU instructions
that guarantees the effectiveness of the memory barrier.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-23 14:10   ` Olivier MATZ
@ 2014-05-26 13:57     ` Ananyev, Konstantin
  2014-05-26 14:20       ` Olivier MATZ
  0 siblings, 1 reply; 7+ messages in thread
From: Ananyev, Konstantin @ 2014-05-26 13:57 UTC (permalink / raw)
  To: Olivier MATZ, dev


Hi Oliver,

>> So with the following fragment of code:
>> extern int *x;
>> extern  __128i a, *p;
>> L0:
>> _mm_stream_si128( p, a);
>> rte_compiler_barrier();
>> L1:
>> *x = 0;
>>
>> There is no guarantee that store at L0 will always be finished
>> before store at L1.

>This code fragment looks very similar to what is done in
>__rte_ring_sp_do_enqueue():
>
>    [...]
>     ENQUEUE_PTRS(); /* I expect it is converted to an SSE store */
>     rte_compiler_barrier();
>     [...]
>     r->prod.tail = prod_next;
>So, according to your previous explanation, I understand that
>this code would require a write memory barrier in place of the
>compiler barrier. Am I wrong?

No, right now compiler barrier is enough here.
ENQUEUE_PTRS() doesn't use Non-Temporal stores (MOVNT*), so write order should be guaranteed.
Though, if in future we'll change ENQUEUE_PTRS() to use  non-tempral stores, we'll have to use sfence(or mfence). 

>Moreover, if I understand well, a real wmb() is needed only if
>a SSE store is issued. But the programmer may not control that,
>it's the job of the compiler.

'Normal' SIMD writes are not reordered.
So it is ok for the compiler to use them if appropriate.  

> > But now, there seems a confusion: everyone has to remember that
>> smp_mb() and smp_wmb() are 'real' fences, while smp_rmb() is not.
>> That's why my suggestion was to simply keep using compiler_barrier()
>> for all cases, when we don't need real fence.

>I'm not sure the programmer has to know which smp_*mb() is a real fence
>or not. He just expects that it generates the proper CPU instructions
>that guarantees the effectiveness of the memory barrier.

In most cases just a compiler barrier is enough, but there are few exceptions.
Always using fence instructions -  means introduce unnecessary slowdown for cases, when order is guaranteed.
No using fences in cases, when they are needed - means introduce race window and possible data corruption.
That's why right now people can use either rte_compiler_barrier() or mb/rmb/wmb - whatever is appropriate for particular case.

Konstantin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [dpdk-dev] [PATCH] atomic: clarify use of memory barriers
  2014-05-26 13:57     ` Ananyev, Konstantin
@ 2014-05-26 14:20       ` Olivier MATZ
  0 siblings, 0 replies; 7+ messages in thread
From: Olivier MATZ @ 2014-05-26 14:20 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hi Konstantin,

On 05/26/2014 03:57 PM, Ananyev, Konstantin wrote:
> In most cases just a compiler barrier is enough, but there are few exceptions.
> Always using fence instructions -  means introduce unnecessary slowdown for cases, when order is guaranteed.
> No using fences in cases, when they are needed - means introduce race window and possible data corruption.
> That's why right now people can use either rte_compiler_barrier() or mb/rmb/wmb - whatever is appropriate for particular case.

OK, so let's drop this patch for now.
Thank you for reviewing and commenting.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-05-26 14:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-20  9:36 [dpdk-dev] [PATCH] atomic: clarify use of memory barriers Olivier Matz
2014-05-20 10:05 ` Ananyev, Konstantin
2014-05-20 12:12   ` Olivier MATZ
2014-05-20 16:35     ` Ananyev, Konstantin
2014-05-23 14:10   ` Olivier MATZ
2014-05-26 13:57     ` Ananyev, Konstantin
2014-05-26 14:20       ` Olivier MATZ

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).