DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal
@ 2020-03-10 17:49 Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
                   ` (10 more replies)
  0 siblings, 11 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [1] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (8):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 8 files changed, 175 insertions(+), 99 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..d20f60c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (2 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions Phil Yang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (3 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 06/10] service: remove redundant code Phil Yang
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8..a691f5d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -333,7 +333,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -376,10 +376,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -433,7 +433,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -703,7 +703,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -782,7 +782,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -815,7 +815,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -873,7 +873,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -883,7 +883,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 06/10] service: remove redundant code
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (4 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service Phil Yang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index a691f5d..6990dc2 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -549,21 +549,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -595,19 +584,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (5 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 06/10] service: remove redundant code Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly Phil Yang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6990dc2..b37fc56 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -367,12 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (6 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier Phil Yang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes more costlier than using atomic operations. Assume that the
application passes the right parameters and reduce the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b37fc56..0186024 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -357,7 +357,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -371,7 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -409,24 +409,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service is
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -446,7 +436,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (7 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations Phil Yang
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 0186024..efb3c9f 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -329,7 +329,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -372,11 +373,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -412,11 +422,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service is
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -549,24 +559,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -622,7 +640,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -705,7 +724,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (8 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index efb3c9f..68542b0 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -176,9 +176,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -190,9 +192,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -261,8 +265,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -278,9 +282,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -298,11 +303,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -313,11 +319,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -439,7 +446,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -450,8 +458,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -660,9 +666,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (9 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-12  7:44 ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
                     ` (10 more replies)
  10 siblings, 11 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [1] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (8):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 8 files changed, 175 insertions(+), 99 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..fa10deb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int16_t success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (2 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions Phil Yang
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (3 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 06/10] service: remove redundant code Phil Yang
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8..a691f5d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -333,7 +333,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -376,10 +376,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -433,7 +433,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -703,7 +703,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -782,7 +782,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -815,7 +815,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -873,7 +873,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -883,7 +883,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 06/10] service: remove redundant code
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (4 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service Phil Yang
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index a691f5d..6990dc2 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -549,21 +549,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -595,19 +584,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (5 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 06/10] service: remove redundant code Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly Phil Yang
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6990dc2..b37fc56 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -367,12 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (6 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier Phil Yang
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b37fc56..670f5a9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -357,7 +357,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -371,7 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -409,24 +409,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -446,7 +436,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (7 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations Phil Yang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 670f5a9..96a59b6 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -329,7 +329,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -372,11 +373,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -412,11 +422,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -549,24 +559,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -622,7 +640,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -705,7 +724,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (8 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 96a59b6..9c02c24 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -176,9 +176,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -190,9 +192,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -261,8 +265,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -278,9 +282,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -298,11 +303,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -313,11 +319,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -439,7 +446,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -450,8 +458,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -660,9 +666,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (9 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-17  1:17   ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
                       ` (14 more replies)
  10 siblings, 15 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [2] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

v3:
add libatomic dependency for 32-bit clang

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (10):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  eal/build: add libatomic dependency for 32-bit clang
  build: remove redundant code
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 drivers/event/octeontx/meson.build               |   5 -
 drivers/event/octeontx2/meson.build              |   5 -
 drivers/event/opdl/meson.build                   |   5 -
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_eal/meson.build                       |   6 +
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_rcu/meson.build                       |   5 -
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 13 files changed, 181 insertions(+), 119 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                       ` (13 subsequent siblings)
  14 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
                       ` (12 subsequent siblings)
  14 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-24  6:08       ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
                       ` (11 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

When compiling with clang on 32-bit platforms, we are missing copies
of 64-bit atomic functions. We can solve this by linking against
libatomic for the drivers and libs which need those atomic ops.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/meson.build | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 4be5118..3b10eae 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -20,6 +20,12 @@ endif
 if cc.has_function('getentropy', prefix : '#include <unistd.h>')
 	cflags += '-DRTE_LIBEAL_USE_GETENTROPY'
 endif
+
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
+
 sources = common_sources + env_sources
 objs = common_objs + env_objs
 headers = common_headers + env_headers
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 04/12] build: remove redundant code
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (2 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-24  6:14       ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                       ` (10 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

All these libs and drivers are built upon the eal lib. So when
compiling with clang on 32-bit platforms linking against libatomic
for the eal lib is sufficient. Remove the redundant code.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/event/octeontx/meson.build  | 5 -----
 drivers/event/octeontx2/meson.build | 5 -----
 drivers/event/opdl/meson.build      | 5 -----
 lib/librte_rcu/meson.build          | 5 -----
 4 files changed, 20 deletions(-)

diff --git a/drivers/event/octeontx/meson.build b/drivers/event/octeontx/meson.build
index 73118a4..2b74bb6 100644
--- a/drivers/event/octeontx/meson.build
+++ b/drivers/event/octeontx/meson.build
@@ -11,8 +11,3 @@ sources = files('ssovf_worker.c',
 )
 
 deps += ['common_octeontx', 'mempool_octeontx', 'bus_vdev', 'pmd_octeontx']
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
diff --git a/drivers/event/octeontx2/meson.build b/drivers/event/octeontx2/meson.build
index 56febb8..dfe8fc4 100644
--- a/drivers/event/octeontx2/meson.build
+++ b/drivers/event/octeontx2/meson.build
@@ -20,11 +20,6 @@ if not dpdk_conf.get('RTE_ARCH_64')
 	extra_flags += ['-Wno-int-to-pointer-cast', '-Wno-pointer-to-int-cast']
 endif
 
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
-
 foreach flag: extra_flags
 	if cc.has_argument(flag)
 		cflags += flag
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index e67b164..566462f 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -10,8 +10,3 @@ sources = files(
 	'opdl_test.c',
 )
 deps += ['bus_vdev']
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
diff --git a/lib/librte_rcu/meson.build b/lib/librte_rcu/meson.build
index 62920ba..0c2d5a2 100644
--- a/lib/librte_rcu/meson.build
+++ b/lib/librte_rcu/meson.build
@@ -5,8 +5,3 @@ allow_experimental_apis = true
 
 sources = files('rte_rcu_qsbr.c')
 headers = files('rte_rcu_qsbr.h')
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (3 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-23 16:54       ` [dpdk-dev] [PATCH v2] " Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                       ` (9 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..fa10deb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int16_t success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (4 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-23 18:48       ` Ananyev, Konstantin
  2020-04-23 17:16       ` [dpdk-dev] [PATCH v2] " Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
                       ` (8 subsequent siblings)
  14 siblings, 2 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (5 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:57       ` Van Haaren, Harry
                         ` (2 more replies)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
                       ` (7 subsequent siblings)
  14 siblings, 3 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b0b78ba..2117726 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -336,7 +336,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -379,10 +379,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -436,7 +436,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -706,7 +706,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -785,7 +785,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -818,7 +818,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -876,7 +876,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -886,7 +886,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 08/12] service: remove redundant code
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (6 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
                       ` (6 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 2117726..557b5a9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -552,21 +552,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -598,19 +587,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (7 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
                       ` (5 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 557b5a9..32a2f8a 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (8 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
                       ` (4 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 32a2f8a..0843c3c 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (9 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
                       ` (3 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 0843c3c..c033224 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +376,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (10 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
                       ` (2 subsequent siblings)
  14 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c033224..d31663e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -179,9 +179,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -193,9 +195,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -264,8 +268,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -281,9 +285,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -301,11 +306,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,11 +322,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -442,7 +449,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -453,8 +461,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -663,9 +669,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (11 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-18 14:01     ` Van Haaren, Harry
  2020-03-18 15:13       ` Thomas Monjalon
  2020-03-20  4:51       ` Honnappa Nagarahalli
  2020-04-03  7:23     ` Mattias Rönnblom
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
  14 siblings, 2 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-03-18 14:01 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Hi Phil & Honnappa,

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
> 
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:
> 
> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
C11 based atomics model is a more gradual step to achieving the goal, and at that
point add a checkpatch warning for additions of rte_atomic*?

More on [2] in context below.

The above is my point-of-view, of course I'd like more people from the DPDK community
to provide their input too.


> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
> 
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
enabling in a wider context that just service cores?

I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.

As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).

Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
(DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
performance for the workload.

To put some discussions/solutions on the table:
- Shared Maintainership of a component?
     Split in functionality and C11 atomics implementation
     Obviously there would be collaboration required in such a case.
- Maybe shared maintainership is too much?
     A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?


Hope my concerns are understandable, and of course input/feedback welcomed! -Harry


PS: Apologies for the delay in reply - was OOO on Irish national holiday.


> v3:
> add libatomic dependency for 32-bit clang
> 
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
> 
> Honnappa Nagarahalli (2):
>   service: avoid race condition for MT unsafe service
>   service: identify service running on another core correctly
> 
> Phil Yang (10):
>   doc: add generic atomic deprecation section
>   devtools: prevent use of rte atomic APIs in future patches
>   eal/build: add libatomic dependency for 32-bit clang
>   build: remove redundant code
>   vhost: optimize broadcast rarp sync with c11 atomic
>   ipsec: optimize with c11 atomic for sa outbound sqn update
>   service: remove rte prefix from static functions
>   service: remove redundant code
>   service: optimize with c11 one-way barrier
>   service: relax barriers with C11 atomic operations
> 
>  devtools/checkpatches.sh                         |   9 ++
>  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>  drivers/event/octeontx/meson.build               |   5 -
>  drivers/event/octeontx2/meson.build              |   5 -
>  drivers/event/opdl/meson.build                   |   5 -
>  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> -
>  lib/librte_eal/meson.build                       |   6 +
>  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>  lib/librte_ipsec/sa.h                            |   2 +-
>  lib/librte_rcu/meson.build                       |   5 -
>  lib/librte_vhost/vhost.h                         |   2 +-
>  lib/librte_vhost/vhost_user.c                    |   7 +-
>  lib/librte_vhost/virtio_net.c                    |  16 ++-
>  13 files changed, 181 insertions(+), 119 deletions(-)
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
@ 2020-03-18 15:13       ` Thomas Monjalon
  2020-03-20  5:01         ` Honnappa Nagarahalli
  2020-03-20  4:51       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 219+ messages in thread
From: Thomas Monjalon @ 2020-03-18 15:13 UTC (permalink / raw)
  To: Phil Yang, Van Haaren, Harry, Honnappa.Nagarahalli
  Cc: Ananyev, Konstantin, stephen, maxime.coquelin, dev,
	david.marchand, jerinj, hemant.agrawal, gavin.hu, ruifeng.wang,
	joyce.kong, nd

18/03/2020 15:01, Van Haaren, Harry:
> Hi Phil & Honnappa,
> 
> From: Phil Yang <phil.yang@arm.com>
> > 
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> > by making use of the memory ordering parameter provided by the user.
> > Various patches submitted in the past [2] and the patches in this series
> > indicate significant performance gains on multiple aarch64 CPUs and no
> > performance loss on x86.
> > 
> > But the existing rte_atomic API implementations cannot be changed as the
> > APIs do not take the memory ordering parameter. The only choice available
> > is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> > order to make this change, the following steps are proposed:
> > 
> > [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> > APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
> a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at that
> point add a checkpatch warning for additions of rte_atomic*?
> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK community
> to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> > 
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
> over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
> I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
> enabling in a wider context that just service cores?
> 
> I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
> understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.
> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
> as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
> has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
> (DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
> performance for the workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?
> 
> 
> Hope my concerns are understandable, and of course input/feedback welcomed! -Harry

Thanks for raising the issue Harry.

I think we should have at least two official maintainers for C11 atomics in general.
C11 conversion is a progressive effort being done, and should be merged step by step.
If C11 maintainers fail to fix some issues on time, then we can hold the effort.
Does it make sense?



^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
  2020-03-18 15:13       ` Thomas Monjalon
@ 2020-03-20  4:51       ` Honnappa Nagarahalli
  2020-03-20 18:32         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20  4:51 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language
> > offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > control by making use of the memory ordering parameter provided by the
> user.
> > Various patches submitted in the past [2] and the patches in this
> > series indicate significant performance gains on multiple aarch64 CPUs
> > and no performance loss on x86.
> >
> > But the existing rte_atomic API implementations cannot be changed as
> > the APIs do not take the memory ordering parameter. The only choice
> > available is replacing the usage of the rte_atomic APIs with C11
> > atomic APIs. In order to make this change, the following steps are proposed:
> >
> > [1] deprecate rte_atomic APIs so that future patches do not use
> > rte_atomic APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch
> is a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will
> move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at
> that point add a checkpatch warning for additions of rte_atomic*?
We have been working on changing existing usages of rte_atomic APIs in DPDK to use C11 atomics. Usually, the x.11 releases have significant amount of changes (not sure how many would use rte_atomic APIs). I would prefer that in 20.11 no additional code is added using rte_atomics APIs. However, I am open to suggestions on the exact time frame.
Once we decide on the release, I think it makes sense to add a 'warning' in the checkpatch to indicate the deprecation timeline and add an 'error' after the release.

> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK
> community to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> >
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter
> libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of
> DPDK. I have some concerns over increased complexity of C11 implementation
> vs the (already complex) rte_atomic implementation today.
I agree that it C11 changes are complex, especially if one is starting out to understand what these APIs provide. From my experience, once few underlying concepts are understood, reviewing or making changes do not take too much time.

> I see other patchsets enabling C11 across other DPDK components, so maybe
> we should also discuss C11 enabling in a wider context that just service cores?
Yes, agree. We are in the process of making changes to other areas as well.

> 
> I don't think it fair to expect all developers to be well versed in C11 atomic
> semantics like understanding the complex interactions between the various
> C11 RELEASE, AQUIRE barriers requires.
C11 has been around from sometime now. To my surprise, OVS already uses C11 APIs extensively. VPP has been accepting C11 related changes from past couple of years. Having said that, I agree in general that not everyone is well versed.

> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor
Right now, the patches are split into multiple commits. If required I can host a call to go over simple C11 API usages (sufficient to cover the usage in service core) and the changes in this patch. If you find that particular areas need more understanding I can work on providing additional information such as memory order ladder diagrams. Please let me know what you think.

> of atomic-implementation, as it could lead to racey bugs that are likely
> extremely difficult to track down. (The recent race-on-exit has proven the
> difficulty in reproducing, and that's with an atomics model I'm quite familiar
> with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but
> I'd like to discuss how we (DPDK community) can best port-to and maintain a
> complex multi-threaded service library with best-in-class performance for the
> workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug
> is enough?
I think shared maintainer ship could be too much as there are many changes. But, I and other engineers from Arm (I would include Arm ecosystem as well) can definitely help out on debug and reviews involving C11 APIs. (I see Thomas's reply on this topic).

> 
> 
> Hope my concerns are understandable, and of course input/feedback
> welcomed! -Harry
No worries at all. We are here to help and make this as easy as possible.

> 
> 
> PS: Apologies for the delay in reply - was OOO on Irish national holiday.
Same here, was on vacation for 3 days.

> 
> 
> > v3:
> > add libatomic dependency for 32-bit clang
> >
> > v2:
> > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > 2. fix typos.
> >
> > Honnappa Nagarahalli (2):
> >   service: avoid race condition for MT unsafe service
> >   service: identify service running on another core correctly
> >
> > Phil Yang (10):
> >   doc: add generic atomic deprecation section
> >   devtools: prevent use of rte atomic APIs in future patches
> >   eal/build: add libatomic dependency for 32-bit clang
> >   build: remove redundant code
> >   vhost: optimize broadcast rarp sync with c11 atomic
> >   ipsec: optimize with c11 atomic for sa outbound sqn update
> >   service: remove rte prefix from static functions
> >   service: remove redundant code
> >   service: optimize with c11 one-way barrier
> >   service: relax barriers with C11 atomic operations
> >
> >  devtools/checkpatches.sh                         |   9 ++
> >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> >  drivers/event/octeontx/meson.build               |   5 -
> >  drivers/event/octeontx2/meson.build              |   5 -
> >  drivers/event/opdl/meson.build                   |   5 -
> >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > -
> >  lib/librte_eal/meson.build                       |   6 +
> >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> >  lib/librte_ipsec/sa.h                            |   2 +-
> >  lib/librte_rcu/meson.build                       |   5 -
> >  lib/librte_vhost/vhost.h                         |   2 +-
> >  lib/librte_vhost/vhost_user.c                    |   7 +-
> >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> >  13 files changed, 181 insertions(+), 119 deletions(-)
> >
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 15:13       ` Thomas Monjalon
@ 2020-03-20  5:01         ` Honnappa Nagarahalli
  2020-03-20 12:20           ` Jerin Jacob
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20  5:01 UTC (permalink / raw)
  To: thomas, Phil Yang, Van Haaren, Harry
  Cc: Ananyev, Konstantin, stephen, maxime.coquelin, dev,
	david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> 18/03/2020 15:01, Van Haaren, Harry:
> > Hi Phil & Honnappa,
> >
> > From: Phil Yang <phil.yang@arm.com>
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by the
> user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event timer
> adapter libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> implementation vs the (already complex) rte_atomic implementation today.
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions between
> the various C11 RELEASE, AQUIRE barriers requires.
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor of atomic-implementation, as it could lead to racey bugs that
> > are likely extremely difficult to track down. (The recent race-on-exit has
> proven the difficulty in reproducing, and that's with an atomics model I'm
> quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library with
> best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> debug is enough?
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> 
> Thanks for raising the issue Harry.
> 
> I think we should have at least two official maintainers for C11 atomics in
> general.
Sure, I can volunteer.

> C11 conversion is a progressive effort being done, and should be merged step
> by step.
Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

> If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> Does it make sense?
I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.

> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20  5:01         ` Honnappa Nagarahalli
@ 2020-03-20 12:20           ` Jerin Jacob
  0 siblings, 0 replies; 219+ messages in thread
From: Jerin Jacob @ 2020-03-20 12:20 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: thomas, Phil Yang, Van Haaren, Harry, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev, david.marchand, jerinj,
	hemant.agrawal, Gavin Hu, Ruifeng Wang, Joyce Kong, nd

On Fri, Mar 20, 2020 at 10:31 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > 18/03/2020 15:01, Van Haaren, Harry:
> > > Hi Phil & Honnappa,
> > >
> > > From: Phil Yang <phil.yang@arm.com>
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by the
> > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > >
> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event timer
> > adapter libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions between
> > the various C11 RELEASE, AQUIRE barriers requires.
> > >
> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor of atomic-implementation, as it could lead to racey bugs that
> > > are likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model I'm
> > quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library with
> > best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> > >
> > >
> > > Hope my concerns are understandable, and of course input/feedback
> > > welcomed! -Harry
> >
> > Thanks for raising the issue Harry.
> >
> > I think we should have at least two official maintainers for C11 atomics in
> > general.
> Sure, I can volunteer.
>
> > C11 conversion is a progressive effort being done, and should be merged step
> > by step.
> Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
> The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

+1. We must define a time frame otherwise we might be making these
changes forever.

>
> > If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> > Does it make sense?
> I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.
>
> >
>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20  4:51       ` Honnappa Nagarahalli
@ 2020-03-20 18:32         ` Honnappa Nagarahalli
  2020-03-27 14:47           ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20 18:32 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, Carrillo, Erik G, nd, Honnappa Nagarahalli, nd

+ Erik as there are similar changes to timer library

<snip>

> 
> > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by
> > > the
> > user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> We have been working on changing existing usages of rte_atomic APIs in DPDK
> to use C11 atomics. Usually, the x.11 releases have significant amount of
> changes (not sure how many would use rte_atomic APIs). I would prefer that
> in 20.11 no additional code is added using rte_atomics APIs. However, I am
> open to suggestions on the exact time frame.
> Once we decide on the release, I think it makes sense to add a 'warning' in the
> checkpatch to indicate the deprecation timeline and add an 'error' after the
> release.
> 
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event
> > > timer adapter
> > libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> I agree that it C11 changes are complex, especially if one is starting out to
> understand what these APIs provide. From my experience, once few
> underlying concepts are understood, reviewing or making changes do not take
> too much time.
> 
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> Yes, agree. We are in the process of making changes to other areas as well.
> 
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions
> > between the various
> > C11 RELEASE, AQUIRE barriers requires.
> C11 has been around from sometime now. To my surprise, OVS already uses
> C11 APIs extensively. VPP has been accepting C11 related changes from past
> couple of years. Having said that, I agree in general that not everyone is well
> versed.
> 
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor
> Right now, the patches are split into multiple commits. If required I can host a
> call to go over simple C11 API usages (sufficient to cover the usage in service
> core) and the changes in this patch. If you find that particular areas need more
> understanding I can work on providing additional information such as memory
> order ladder diagrams. Please let me know what you think.
When I started working with C11 APIs, I had referred to the following blogs.
https://preshing.com/20120913/acquire-and-release-semantics/
https://preshing.com/20130702/the-happens-before-relation/
https://preshing.com/20130823/the-synchronizes-with-relation/

These will be helpful to understand the changes.

> 
> > of atomic-implementation, as it could lead to racey bugs that are
> > likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model
> > I'm quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library
> > with best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> I think shared maintainer ship could be too much as there are many changes.
> But, I and other engineers from Arm (I would include Arm ecosystem as well)
> can definitely help out on debug and reviews involving C11 APIs. (I see
> Thomas's reply on this topic).
> 
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> No worries at all. We are here to help and make this as easy as possible.
> 
> >
> >
> > PS: Apologies for the delay in reply - was OOO on Irish national holiday.
> Same here, was on vacation for 3 days.
> 
> >
> >
> > > v3:
> > > add libatomic dependency for 32-bit clang
> > >
> > > v2:
> > > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > > 2. fix typos.
> > >
> > > Honnappa Nagarahalli (2):
> > >   service: avoid race condition for MT unsafe service
> > >   service: identify service running on another core correctly
> > >
> > > Phil Yang (10):
> > >   doc: add generic atomic deprecation section
> > >   devtools: prevent use of rte atomic APIs in future patches
> > >   eal/build: add libatomic dependency for 32-bit clang
> > >   build: remove redundant code
> > >   vhost: optimize broadcast rarp sync with c11 atomic
> > >   ipsec: optimize with c11 atomic for sa outbound sqn update
> > >   service: remove rte prefix from static functions
> > >   service: remove redundant code
> > >   service: optimize with c11 one-way barrier
> > >   service: relax barriers with C11 atomic operations
> > >
> > >  devtools/checkpatches.sh                         |   9 ++
> > >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> > >  drivers/event/octeontx/meson.build               |   5 -
> > >  drivers/event/octeontx2/meson.build              |   5 -
> > >  drivers/event/opdl/meson.build                   |   5 -
> > >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > > -
> > >  lib/librte_eal/meson.build                       |   6 +
> > >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> > >  lib/librte_ipsec/sa.h                            |   2 +-
> > >  lib/librte_rcu/meson.build                       |   5 -
> > >  lib/librte_vhost/vhost.h                         |   2 +-
> > >  lib/librte_vhost/vhost_user.c                    |   7 +-
> > >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> > >  13 files changed, 181 insertions(+), 119 deletions(-)
> > >
> > > --
> > > 2.7.4
> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-23 18:48       ` Ananyev, Konstantin
  2020-03-23 19:07         ` Honnappa Nagarahalli
  2020-04-23 17:16       ` [dpdk-dev] [PATCH v2] " Phil Yang
  1 sibling, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-03-23 18:48 UTC (permalink / raw)
  To: Phil Yang, thomas, Van Haaren, Harry, stephen, maxime.coquelin,
	dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Hi Phil,

> 
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. This introduced an unnecessary full barrier by calling
> the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> patch optimized it with c11 atomic and eliminated the expensive barrier
> for aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
>  lib/librte_ipsec/sa.h        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
> index 0c2f76a..e884af7 100644
> --- a/lib/librte_ipsec/ipsec_sqn.h
> +++ b/lib/librte_ipsec/ipsec_sqn.h
> @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
> 
>  	n = *num;
>  	if (SQN_ATOMIC(sa))
> -		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
> +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> +			__ATOMIC_RELAXED);

One generic thing to note:
clang for i686 in some cases will generate a proper function call for
64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
Does anyone consider it as a potential problem?
It probably not a big deal, but would like to know broader opinion.

>  	else {
>  		sqn = sa->sqn.outb.raw + n;
>  		sa->sqn.outb.raw = sqn;
> diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
> index d22451b..cab9a2e 100644
> --- a/lib/librte_ipsec/sa.h
> +++ b/lib/librte_ipsec/sa.h
> @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
>  	 */
>  	union {
>  		union {
> -			rte_atomic64_t atom;
> +			uint64_t atom;
>  			uint64_t raw;
>  		} outb;

If we don't need rte_atomic64 here anymore,
then I think we can collapse the union to just:
uint64_t outb; 

>  		struct {
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 18:48       ` Ananyev, Konstantin
@ 2020-03-23 19:07         ` Honnappa Nagarahalli
  2020-03-23 19:18           ` Ananyev, Konstantin
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-23 19:07 UTC (permalink / raw)
  To: Ananyev, Konstantin, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> sqn update
> 
> Hi Phil,
> 
> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. This introduced an unnecessary full barrier by calling
> > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > patch optimized it with c11 atomic and eliminated the expensive
> > barrier for aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> >  lib/librte_ipsec/sa.h        | 2 +-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > --- a/lib/librte_ipsec/ipsec_sqn.h
> > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > uint32_t *num)
> >
> >  	n = *num;
> >  	if (SQN_ATOMIC(sa))
> > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> >sqn.outb.atom, n);
> > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > +			__ATOMIC_RELAXED);
> 
> One generic thing to note:
> clang for i686 in some cases will generate a proper function call for 64-bit
> __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> Does anyone consider it as a potential problem?
> It probably not a big deal, but would like to know broader opinion.
I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic instructions for the operand size. Otherwise, gcc generates the instructions directly.
I would think the behavior would be the same for clang.

> 
> >  	else {
> >  		sqn = sa->sqn.outb.raw + n;
> >  		sa->sqn.outb.raw = sqn;
> > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > d22451b..cab9a2e 100644
> > --- a/lib/librte_ipsec/sa.h
> > +++ b/lib/librte_ipsec/sa.h
> > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> >  	 */
> >  	union {
> >  		union {
> > -			rte_atomic64_t atom;
> > +			uint64_t atom;
> >  			uint64_t raw;
> >  		} outb;
> 
> If we don't need rte_atomic64 here anymore, then I think we can collapse the
> union to just:
> uint64_t outb;
> 
> >  		struct {
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:07         ` Honnappa Nagarahalli
@ 2020-03-23 19:18           ` Ananyev, Konstantin
  2020-03-23 20:20             ` Honnappa Nagarahalli
  2020-03-24 10:37             ` Phil Yang
  0 siblings, 2 replies; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-03-23 19:18 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, March 23, 2020 7:08 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> <snip>
> 
> > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> > sqn update
> >
> > Hi Phil,
> >
> > >
> > > For SA outbound packets, rte_atomic64_add_return is used to generate
> > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > patch optimized it with c11 atomic and eliminated the expensive
> > > barrier for aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > ---
> > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > >  lib/librte_ipsec/sa.h        | 2 +-
> > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > uint32_t *num)
> > >
> > >  	n = *num;
> > >  	if (SQN_ATOMIC(sa))
> > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > >sqn.outb.atom, n);
> > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > +			__ATOMIC_RELAXED);
> >
> > One generic thing to note:
> > clang for i686 in some cases will generate a proper function call for 64-bit
> > __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> > Does anyone consider it as a potential problem?
> > It probably not a big deal, but would like to know broader opinion.
> I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions directly.
> I would think the behavior would be the same for clang.

From what I see not really.
As an example:

$ cat tatm11.c
#include <stdint.h>

struct x {
        uint64_t v __attribute__((aligned(8)));
};

uint64_t
ffxadd1(struct x *x, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
}

uint64_t
ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
}

gcc for i686 will generate code with cmpxchng8b for both cases.
clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B aligned,
but will emit a function call for ffxadd11().

> 
> >
> > >  	else {
> > >  		sqn = sa->sqn.outb.raw + n;
> > >  		sa->sqn.outb.raw = sqn;
> > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > d22451b..cab9a2e 100644
> > > --- a/lib/librte_ipsec/sa.h
> > > +++ b/lib/librte_ipsec/sa.h
> > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > >  	 */
> > >  	union {
> > >  		union {
> > > -			rte_atomic64_t atom;
> > > +			uint64_t atom;
> > >  			uint64_t raw;
> > >  		} outb;
> >
> > If we don't need rte_atomic64 here anymore, then I think we can collapse the
> > union to just:
> > uint64_t outb;
> >
> > >  		struct {
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:18           ` Ananyev, Konstantin
@ 2020-03-23 20:20             ` Honnappa Nagarahalli
  2020-03-24 13:10               ` Ananyev, Konstantin
  2020-03-24 10:37             ` Phil Yang
  1 sibling, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-23 20:20 UTC (permalink / raw)
  To: Ananyev, Konstantin, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > generate SQN atomically. This introduced an unnecessary full
> > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > eliminated the expensive barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call
> > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> such cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is
> > generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8))); };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned, but will emit a function call for ffxadd11().
Does it require libatomic to be linked in this case? Clang documentation calls out unaligned case where it would generate the function call [1].
On aarch64, the atomic instructions need the address to be aligned.

[1] https://clang.llvm.org/docs/Toolchain.html#atomics-library

> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:18           ` Ananyev, Konstantin
  2020-03-23 20:20             ` Honnappa Nagarahalli
@ 2020-03-24 10:37             ` Phil Yang
  2020-03-24 11:03               ` Ananyev, Konstantin
  1 sibling, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-03-24 10:37 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, thomas, Van Haaren,
	Harry, stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd

Hi Konstantin,

<snip>
> >
> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound
> > > sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> generate
> > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is generated
> only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8)));
> };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned,
> but will emit a function call for ffxadd11().

I guess your testbed is an i386 platform.  However, what I see here is different.

Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
Both Clang and GCC for i686 generate code with xadd for these two cases.

Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
GCC will generate code with cmpxchng8b for both cases.
Clang generated code emits a function call for both cases.

Thanks,
Phil
> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> collapse the
> > > union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 10:37             ` Phil Yang
@ 2020-03-24 11:03               ` Ananyev, Konstantin
  2020-03-25  9:38                 ` Phil Yang
  0 siblings, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 11:03 UTC (permalink / raw)
  To: Phil Yang, Honnappa Nagarahalli, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd


Hi Phil,

> <snip>
> > >
> > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > outbound
> > > > sqn update
> > > >
> > > > Hi Phil,
> > > >
> > > > >
> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > generate
> > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is generated
> > only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8)));
> > };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned,
> > but will emit a function call for ffxadd11().
> 
> I guess your testbed is an i386 platform.  However, what I see here is different.
> 
> Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> Both Clang and GCC for i686 generate code with xadd for these two cases.

I suppose you meant x86_64 here (-m64), right?


> 
> Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> GCC will generate code with cmpxchng8b for both cases.
> Clang generated code emits a function call for both cases.

That's exactly what I am talking about above.
X86_64 (64 bit binary) - no function calls for both gcc and clang
i686 (32 bit binary) - no function calls with gcc, functions calls with clang
when explicit alignment is not specified.

As I said in my initial email, that's probably not a big deal -
from what I was told so far we don't officially support clang for IA-32
and I don't know does anyone uses it at all right now.
Though if someone thinks it is a potential problem here -
it is better to flag it at early stage.
So once again my questions to the community:
1/ Does anyone builds/uses DPDK with i686-clang? 
2/ If there are anyone, can these persons try to evaluate
how big perf drop it would cause for them?  
3/ Is there an option to switch to i686-gcc (supported one)?
Konstantin

> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > collapse the
> > > > union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 20:20             ` Honnappa Nagarahalli
@ 2020-03-24 13:10               ` Ananyev, Konstantin
  2020-03-24 13:21                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 13:10 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd


> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > eliminated the expensive barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call
> > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > such cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is
> > > generated only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8))); };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned, but will emit a function call for ffxadd11().
> Does it require libatomic to be linked in this case? 

Yes, it does.
In fact same story even with current dpdk.org master.
To make i686-native-linuxapp-clang successfully, I have to 
explicitly add EXTRA_LDFLAGS="-latomic".
To be more specific:
$ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
i686-native-linuxapp-clang/lib/librte_distributor.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_rcu.a
U __atomic_compare_exchange_8 U __atomic_load_8

As there were no complains so far, it makes me think that
probably no-one using clang for IA-32 builds.

> Clang documentation calls out unaligned case where it would generate the function call
> [1].

Seems so, and it treats uin64_t as 4B aligned for IA.
 
> On aarch64, the atomic instructions need the address to be aligned.

For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
Again, as I said before, gcc manages to emit code without function calls
for exactly the same source.

> 
> [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> 
> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > collapse the union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 13:10               ` Ananyev, Konstantin
@ 2020-03-24 13:21                 ` Ananyev, Konstantin
  0 siblings, 0 replies; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 13:21 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, Phil Yang, thomas,
	Van Haaren, Harry, stephen, maxime.coquelin, dev, Richardson,
	Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Ananyev, Konstantin
> Sent: Tuesday, March 24, 2020 1:10 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> 
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > > eliminated the expensive barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call
> > > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > > such cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> > > > generated only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8))); };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned, but will emit a function call for ffxadd11().
> > Does it require libatomic to be linked in this case?
> 
> Yes, it does.
> In fact same story even with current dpdk.org master.
> To make i686-native-linuxapp-clang successfully, I have to
> explicitly add EXTRA_LDFLAGS="-latomic".
> To be more specific:
> $ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
> i686-native-linuxapp-clang/lib/librte_distributor.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_rcu.a
> U __atomic_compare_exchange_8 U __atomic_load_8
> 
> As there were no complains so far, it makes me think that
> probably no-one using clang for IA-32 builds.
> 
> > Clang documentation calls out unaligned case where it would generate the function call
> > [1].
> 
> Seems so, and it treats uin64_t as 4B aligned for IA.
correction: for IA-32

> 
> > On aarch64, the atomic instructions need the address to be aligned.
> 
> For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
> Again, as I said before, gcc manages to emit code without function calls
> for exactly the same source.
> 
> >
> > [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> >
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > > collapse the union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 11:03               ` Ananyev, Konstantin
@ 2020-03-25  9:38                 ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-03-25  9:38 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, thomas, Van Haaren,
	Harry, stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd, nd

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Tuesday, March 24, 2020 7:04 PM
> To: Phil Yang <Phil.Yang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound sqn update
> 
> 
> Hi Phil,
> 
> > <snip>
> > > >
> > > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound
> > > > > sqn update
> > > > >
> > > > > Hi Phil,
> > > > >
> > > > > >
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > generate
> > > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64.
> This
> > > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > > barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa
> *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call for 64-
> bit
> > > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > > cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> generated
> > > only if the underlying platform does not support the atomic
> > > > instructions for the operand size. Otherwise, gcc generates the
> instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8)));
> > > };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned,
> > > but will emit a function call for ffxadd11().
> >
> > I guess your testbed is an i386 platform.  However, what I see here is
> different.
> >
> > Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> > Both Clang and GCC for i686 generate code with xadd for these two cases.
> 
> I suppose you meant x86_64 here (-m64), right?

Yes. It is x86_64 here.

> 
> 
> >
> > Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> > GCC will generate code with cmpxchng8b for both cases.
> > Clang generated code emits a function call for both cases.
> 
> That's exactly what I am talking about above.
> X86_64 (64 bit binary) - no function calls for both gcc and clang
> i686 (32 bit binary) - no function calls with gcc, functions calls with clang
> when explicit alignment is not specified.
> 
> As I said in my initial email, that's probably not a big deal -
> from what I was told so far we don't officially support clang for IA-32
> and I don't know does anyone uses it at all right now.
> Though if someone thinks it is a potential problem here -
> it is better to flag it at early stage.
> So once again my questions to the community:
> 1/ Does anyone builds/uses DPDK with i686-clang?
> 2/ If there are anyone, can these persons try to evaluate
> how big perf drop it would cause for them?
> 3/ Is there an option to switch to i686-gcc (supported one)?
> Konstantin
> 
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the
> > > > > union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20 18:32         ` Honnappa Nagarahalli
@ 2020-03-27 14:47           ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-03-27 14:47 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, Carrillo, Erik G, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, March 20, 2020 6:32 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Carrillo, Erik G <erik.g.carrillo@intel.com>; nd
> <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> + Erik as there are similar changes to timer library
> 
> <snip>
> 
> >
> > > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by
> > > > the
> > > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic
> APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > We have been working on changing existing usages of rte_atomic APIs in
> DPDK
> > to use C11 atomics. Usually, the x.11 releases have significant amount of
> > changes (not sure how many would use rte_atomic APIs). I would prefer that
> > in 20.11 no additional code is added using rte_atomics APIs. However, I am
> > open to suggestions on the exact time frame.
> > Once we decide on the release, I think it makes sense to add a 'warning'
> in the
> > checkpatch to indicate the deprecation timeline and add an 'error' after
> the
> > release.

The above sounds reasonable - mainly let's not block any code that exists
or is being developed today using rte_atomic_* APIs from making a release.


> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in
> patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event
> > > > timer adapter
> > > libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > > implementation vs the (already complex) rte_atomic implementation today.
> > I agree that it C11 changes are complex, especially if one is starting out
> to
> > understand what these APIs provide. From my experience, once few
> > underlying concepts are understood, reviewing or making changes do not
> take
> > too much time.
> >
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > Yes, agree. We are in the process of making changes to other areas as
> well.
> >
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions
> > > between the various
> > > C11 RELEASE, AQUIRE barriers requires.
> > C11 has been around from sometime now. To my surprise, OVS already uses
> > C11 APIs extensively. VPP has been accepting C11 related changes from past
> > couple of years. Having said that, I agree in general that not everyone is
> > well versed.

Fair point - like so many things, once familiar with it, it becomes easy :)


> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor
> > Right now, the patches are split into multiple commits. If required I can
> host a
> > call to go over simple C11 API usages (sufficient to cover the usage in
> service
> > core) and the changes in this patch. If you find that particular areas
> need more
> > understanding I can work on providing additional information such as
> memory
> > order ladder diagrams. Please let me know what you think.

Thanks for the offer - I will need to do my due diligence on reiview before
taking up any of your or other C11 folks time.

> When I started working with C11 APIs, I had referred to the following blogs.
> https://preshing.com/20120913/acquire-and-release-semantics/
> https://preshing.com/20130702/the-happens-before-relation/
> https://preshing.com/20130823/the-synchronizes-with-relation/
> 
> These will be helpful to understand the changes.

Thanks, indeed good articles. I found the following slide deck particularly
informative due to the fantastic diagrams (eg, slide 23):
https://mariadb.org/wp-content/uploads/2017/11/2017-11-Memory-barriers.pdf

That said, finding a nice diagram and understanding the implications of
actually using it is different! I hope to properly review the
service-cores patches next week.

> > > of atomic-implementation, as it could lead to racey bugs that are
> > > likely extremely difficult to track down. (The recent race-on-exit has
> > > proven the difficulty in reproducing, and that's with an atomics model
> > > I'm quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library
> > > with best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > > debug is enough?
> > I think shared maintainer ship could be too much as there are many
> changes.
> > But, I and other engineers from Arm (I would include Arm ecosystem as
> well)
> > can definitely help out on debug and reviews involving C11 APIs. (I see
> > Thomas's reply on this topic).

Thanks for the offer - as above, ball on my side, I'll go review.

<snip>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (12 preceding siblings ...)
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
@ 2020-04-03  7:23     ` Mattias Rönnblom
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
  14 siblings, 0 replies; 219+ messages in thread
From: Mattias Rönnblom @ 2020-04-03  7:23 UTC (permalink / raw)
  To: Phil Yang, thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

On 2020-03-17 02:17, Phil Yang wrote:
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
>
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:

First of all I must say I much support the effort of introducing C11 
atomics or something equivalent into DPDK, across the board.


What's being proposed however, is not to use C11 atomics, but rather the 
GCC built-ins designed to allow an efficient C11 atomics implementation. 
The C11 atomic API is found in <stdatomic.h>. Also, the <rte_atomic.h> 
API is not using __sync. It doesn't dictate any particular 
implementation at all.


I don't think directly accessing GCC built-ins across the whole DPDK 
code base sounds like a good idea at all.


Beyond just being plain ugly, and setting a bad precedence, using 
built-ins directly also effectively prevents API extensions. Although 
C11 is new and shiny, I'm sure there will come a day when we want to 
extend this API, to make it easier for consumers and avoid code 
duplication. Some parts of the DPDK code base already today define their 
own __atomic_* functions. Bad idea to use the "__*" namespace, 
especially in a way that has a real risk of future collisions. It's also 
confusing for anyone reading the code, since they are led to believe 
it's a GCC built-in.


Direct calls to GCC built-ins also prevents the use of any other 
implementation than the GCC built-ins, if some ISA or ISA implementation 
would benefit from this. This should be avoided of course, so it's just 
a minor objection.


I think the right way to go about this is not to deprecate 
<rte_atomic.h>. Rather, <rte_atomic.h> should be reshaped into something 
that closely maps to the GCC built-ins for C11 (which seems more 
convenient than real C11 atomics). The parts of <rte_atomic.h> that 
doesn't fit the new model, should be deprecated.


To summarize, I'm not in favor of deprecating <rte_atomic.h>. If we 
should deprecate anything, it's directly accessing compiler built-ins.

> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
>
> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
>
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
>
> v3:
> add libatomic dependency for 32-bit clang
>
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
>
> Honnappa Nagarahalli (2):
>    service: avoid race condition for MT unsafe service
>    service: identify service running on another core correctly
>
> Phil Yang (10):
>    doc: add generic atomic deprecation section
>    devtools: prevent use of rte atomic APIs in future patches
>    eal/build: add libatomic dependency for 32-bit clang
>    build: remove redundant code
>    vhost: optimize broadcast rarp sync with c11 atomic
>    ipsec: optimize with c11 atomic for sa outbound sqn update
>    service: remove rte prefix from static functions
>    service: remove redundant code
>    service: optimize with c11 one-way barrier
>    service: relax barriers with C11 atomic operations
>
>   devtools/checkpatches.sh                         |   9 ++
>   doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>   drivers/event/octeontx/meson.build               |   5 -
>   drivers/event/octeontx2/meson.build              |   5 -
>   drivers/event/opdl/meson.build                   |   5 -
>   lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
>   lib/librte_eal/meson.build                       |   6 +
>   lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>   lib/librte_ipsec/sa.h                            |   2 +-
>   lib/librte_rcu/meson.build                       |   5 -
>   lib/librte_vhost/vhost.h                         |   2 +-
>   lib/librte_vhost/vhost_user.c                    |   7 +-
>   lib/librte_vhost/virtio_net.c                    |  16 ++-
>   13 files changed, 181 insertions(+), 119 deletions(-)
>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
@ 2020-04-03 11:57       ` Van Haaren, Harry
  2020-04-08 10:14         ` Phil Yang
  2020-04-05 21:35       ` Honnappa Nagarahalli
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
  2 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:57 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com; stable@dpdk.org
> Subject: [PATCH v3 07/12] service: remove rte prefix from static functions
> 
> Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
> Fixes: 21698354c832 ("service: introduce service cores concept")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>


This patchset needs a rebase since the EAL file movement got merged,
however I'll review here so we can include some Acks etc and make
progress.

Is this really a "Fix"? The internal function names were not exported
in the .map file, so are not part of public ABI. This is an internal
naming improvement (thanks for doing cleanup), but I don't think the
Fixes: tags make sense?

Also I'm not sure if we want to port this patch back to stable? Changing (internal) function names seems like unnecessary churn, and hence risk to a stable release, without any benefit?

---

<snip patch diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/12] service: remove redundant code
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
@ 2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-05 18:35         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:58 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com; Stable@dpdk.org
> Subject: [PATCH v3 08/12] service: remove redundant code
> 
> The service id validation is verified in the calling function, remove
> the redundant code inside the service_update function.
> 
> Fixes: 21698354c832 ("service: introduce service cores concept")
> Cc: Stable@dpdk.org
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>


Same comment as patch 7/12, is this really a "Fix"? This functionality
is not "broken" in  the current code? And is there value in porting
to stable? I'd see this as unnecessary churn.

As before, it is a valid cleanup (thanks), and I'd like to take it for
new DPDK releases.

Happy to Ack without Fixes or Cc Stable, if that's acceptable to you?



> ---
>  lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 2117726..557b5a9 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -552,21 +552,10 @@ rte_service_start_with_defaults(void)
>  }
> 
>  static int32_t
> -service_update(struct rte_service_spec *service, uint32_t lcore,
> +service_update(uint32_t sid, uint32_t lcore,
>  		uint32_t *set, uint32_t *enabled)
>  {
> -	uint32_t i;
> -	int32_t sid = -1;
> -
> -	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> -		if ((struct rte_service_spec *)&rte_services[i] == service &&
> -				service_valid(i)) {
> -			sid = i;
> -			break;
> -		}
> -	}
> -
> -	if (sid == -1 || lcore >= RTE_MAX_LCORE)
> +	if (lcore >= RTE_MAX_LCORE)
>  		return -EINVAL;
> 
>  	if (!lcore_states[lcore].is_service_core)
> @@ -598,19 +587,23 @@ service_update(struct rte_service_spec *service,
> uint32_t lcore,
>  int32_t
>  rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
>  {
> -	struct rte_service_spec_impl *s;
> -	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> +	/* validate ID, or return error value */
> +	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> +		return -EINVAL;
> +
>  	uint32_t on = enabled > 0;
> -	return service_update(&s->spec, lcore, &on, 0);
> +	return service_update(id, lcore, &on, 0);
>  }
> 
>  int32_t
>  rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
>  {
> -	struct rte_service_spec_impl *s;
> -	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> +	/* validate ID, or return error value */
> +	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> +		return -EINVAL;
> +
>  	uint32_t enabled;
> -	int ret = service_update(&s->spec, lcore, 0, &enabled);
> +	int ret = service_update(id, lcore, 0, &enabled);
>  	if (ret == 0)
>  		return enabled;
>  	return ret;
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-04 18:03         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:58 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com; Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe service
> 
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> There has possible that a MT unsafe service might get configured to
> run on another core while the service is running currently. This
> might result in the MT unsafe service running on multiple cores
> simultaneously. Use 'execute_lock' always when the service is
> MT unsafe.
> 
> Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>

We should put "fix" in the title, once converged on an implementation.

Regarding Fixes and stable backport, we should consider if
fixing this in stable with a performance degradation, fixing with more
complex solution, or documenting a known issue a better solution.


This fix (always taking the atomic lock) will have a negative performance
impact on existing code using services. We should investigate a way
to fix it without causing datapath performance degradation.

I think there is a way to achieve this by moving more checks/time
to the control path (lcore updating the map), and not forcing the
datapath lcore to always take an atomic.

In this particular case, we have a counter for number of iterations
that a service has done. If this increments we know that the lcore
running the service has re-entered the critical section, so would
see an updated "needs atomic" flag.

This approach may introduce a predictable branch on the datapath,
however the cost of a predictable branch vs always taking an atomic
is order(s?) of magnitude, so a branch is much preferred.

It must be possible to avoid the datapath overhead using a scheme
like this. It will likely be more complex than your proposed change
below, however if it avoids datapath performance drops I feel that
a more complex solution is worth investigating at least.

A unit test is required to validate a fix like this - although perhaps
found by inspection/review, a real-world test to validate would give
confidence.


Thoughts on such an approach?



> ---
>  lib/librte_eal/common/rte_service.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 557b5a9..32a2f8a 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
>  	uint8_t internal_flags;
> 
>  	/* per service statistics */
> +	/* Indicates how many cores the service is mapped to run on.
> +	 * It does not indicate the number of cores the service is running
> +	 * on currently.
> +	 */
>  	rte_atomic32_t num_mapped_cores;
>  	uint64_t calls;
>  	uint64_t cycles_spent;
> @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
> 
>  	cs->service_active_on_lcore[i] = 1;
> 
> -	/* check do we need cmpset, if MT safe or <= 1 core
> -	 * mapped, atomic ops are not required.
> -	 */
> -	const int use_atomics = (service_mt_safe(s) == 0) &&
> -				(rte_atomic32_read(&s->num_mapped_cores) > 1);
> -	if (use_atomics) {
> +	if (service_mt_safe(s) == 0) {
>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
>  			return -EBUSY;
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
@ 2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-05  2:43         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:58 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com; Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>; stable@dpdk.org
> Subject: [PATCH v3 10/12] service: identify service running on another core
> correctly
> 
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> The logic to identify if the MT unsafe service is running on another
> core can return -EBUSY spuriously. In such cases, running the service
> becomes costlier than using atomic operations. Assume that the
> application passes the right parameters and reduces the number of
> instructions for all cases.
> 
> Cc: stable@dpdk.org
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>

Is this fixing broken functionality, or does it aim to only "optimize"?
Lack of "fixes" tag suggests optimization.

I'm cautious about the commit phrase "Assume that the application ...",
if the code was previously checking things, we must not stop checking
them now, this may introduce race-conditions in existing applications?

It seems like the "serialize_mt_unsafe" branch is being pushed
further down the callgraph, and instead of branching over atomics
this patch forces always executing 2 atomics?

This feels like too specific an optimization/tradeoff, without data to
backup that there are no regressions on any DPDK supported platforms.

DPDK today doesn't have a micro-benchmark to gather such perf data, 
but I would welcome one and we can have a data-driven decision.

Hope this point-of-view makes sense, -Harry

> ---
>  lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
>  1 file changed, 8 insertions(+), 18 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 32a2f8a..0843c3c 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -360,7 +360,7 @@ service_runner_do_callback(struct rte_service_spec_impl
> *s,
>  /* Expects the service 's' is valid. */
>  static int32_t
>  service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
> -	    struct rte_service_spec_impl *s)
> +	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
>  {
>  	if (!s)
>  		return -EINVAL;
> @@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
> 
>  	cs->service_active_on_lcore[i] = 1;
> 
> -	if (service_mt_safe(s) == 0) {
> +	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
>  			return -EBUSY;
> 
> @@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t
> serialize_mt_unsafe)
> 
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> 
> -	/* Atomically add this core to the mapped cores first, then examine if
> -	 * we can run the service. This avoids a race condition between
> -	 * checking the value, and atomically adding to the mapped count.
> +	/* Increment num_mapped_cores to indicate that the service
> +	 * is running on a core.
>  	 */
> -	if (serialize_mt_unsafe)
> -		rte_atomic32_inc(&s->num_mapped_cores);
> +	rte_atomic32_inc(&s->num_mapped_cores);
> 
> -	if (service_mt_safe(s) == 0 &&
> -			rte_atomic32_read(&s->num_mapped_cores) > 1) {
> -		if (serialize_mt_unsafe)
> -			rte_atomic32_dec(&s->num_mapped_cores);
> -		return -EBUSY;
> -	}
> -
> -	int ret = service_run(id, cs, UINT64_MAX, s);
> +	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> 
> -	if (serialize_mt_unsafe)
> -		rte_atomic32_dec(&s->num_mapped_cores);
> +	rte_atomic32_dec(&s->num_mapped_cores);
> 
>  	return ret;
>  }
> @@ -449,7 +439,7 @@ service_runner_func(void *arg)
>  			if (!service_valid(i))
>  				continue;
>  			/* return value ignored as no change to code flow */
> -			service_run(i, cs, service_mask, service_get(i));
> +			service_run(i, cs, service_mask, service_get(i), 1);
>  		}
> 
>  		cs->loops++;
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
@ 2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-06  4:22         ` Honnappa Nagarahalli
  2020-04-08 10:15         ` Phil Yang
  0 siblings, 2 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:58 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> 
> The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
> APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
> c11 atomic one-way barrier.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Based on discussion on-list, it seems the consensus is to not use
GCC builtins, but instead use C11 APIs "proper"? If my conclusion is
correct, the v+1 of this patchset would require updates to that style API.

Inline comments for context below, -Harry 


> ---
>  lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++----------
> -
>  1 file changed, 35 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 0843c3c..c033224 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
>  	 * running this service callback. When not set, a core may take the
>  	 * lock and then run the service callback.
>  	 */
> -	rte_atomic32_t execute_lock;
> +	uint32_t execute_lock;
> 
>  	/* API set/get-able variables */
>  	int8_t app_runstate;
> @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
>  	 * It does not indicate the number of cores the service is running
>  	 * on currently.
>  	 */
> -	rte_atomic32_t num_mapped_cores;
> +	int32_t num_mapped_cores;

Any reason why "int32_t" or "uint32_t" is used over another?
execute_lock is a uint32_t above, num_mapped_cores is an int32_t?


>  	uint64_t calls;
>  	uint64_t cycles_spent;
>  } __rte_cache_aligned;
> @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
>  	rte_smp_rmb();
> 
>  	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
> -	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
> +	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
> +					    __ATOMIC_RELAXED) > 0);
> 
>  	return (s->app_runstate == RUNSTATE_RUNNING) &&
>  		(s->comp_runstate == RUNSTATE_RUNNING) &&
> @@ -375,11 +376,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
>  	cs->service_active_on_lcore[i] = 1;
> 
>  	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
> -		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> +		uint32_t expected = 0;
> +		/* ACQUIRE ordering here is to prevent the callback
> +		 * function from hoisting up before the execute_lock
> +		 * setting.
> +		 */
> +		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
> +			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
>  			return -EBUSY;

Let's try improve the magic "1" and "0" constants, I believe the "1" here
is the desired "new value on success", and the 0 is "bool weak", where our 0/false constant implies a strongly ordered compare exchange?

"Weak is true for weak compare_exchange, which may fail spuriously, and false for the strong variation, which never fails spuriously.", from https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

const uint32_t on_success_value = 1;
const bool weak = 0;
__atomic_compare_exchange_n(&s->execute_lock, &expected, on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);


Although a bit more verbose, I feel this documents usage a lot better,
particularly for those who aren't as familiar with the C11 function
arguments order.

Admittedly with the API change to not use __builtins, perhaps this
comment is moot.


> 
>  		service_runner_do_callback(s, cs, i);
> -		rte_atomic32_clear(&s->execute_lock);
> +		/* RELEASE ordering here is used to pair with ACQUIRE
> +		 * above to achieve lock semantic.
> +		 */
> +		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
>  	} else
>  		service_runner_do_callback(s, cs, i);
> 
> @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t
> serialize_mt_unsafe)
>  	/* Increment num_mapped_cores to indicate that the service
>  	 * is running on a core.
>  	 */
> -	rte_atomic32_inc(&s->num_mapped_cores);
> +	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
> 
>  	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> 
> -	rte_atomic32_dec(&s->num_mapped_cores);
> +	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
> 
>  	return ret;
>  }
> @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
> 
>  	uint64_t sid_mask = UINT64_C(1) << sid;
>  	if (set) {
> -		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
> -			sid_mask;
> +		/* When multiple threads try to update the same lcore
> +		 * service concurrently, e.g. set lcore map followed
> +		 * by clear lcore map, the unsynchronized service_mask
> +		 * values have issues on the num_mapped_cores value
> +		 * consistency. So we use ACQUIRE ordering to pair with
> +		 * the RELEASE ordering to synchronize the service_mask.
> +		 */
> +		uint64_t lcore_mapped = __atomic_load_n(
> +					&lcore_states[lcore].service_mask,
> +					__ATOMIC_ACQUIRE) & sid_mask;

Thanks for the comment - it helps me understand things a bit better.
Some questions/theories to validate;
1) The service_mask ACQUIRE avoids other loads being hoisted above it, correct?

2) There are non-atomic stores to service_mask. Is it correct that the stores themselves aren't the issue, but relative visibility of service_mask stores vs num_mapped_cores? (Detail in (3) below)


>  		if (*set && !lcore_mapped) {
>  			lcore_states[lcore].service_mask |= sid_mask;
> -			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
> +			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
> +					    1, __ATOMIC_RELEASE);
>  		}
>  		if (!*set && lcore_mapped) {
>  			lcore_states[lcore].service_mask &= ~(sid_mask);
> -			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
> +			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
> +					    1, __ATOMIC_RELEASE);
>  		}

3) Here we update the core-local service_mask, and then update the
num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures
that the previous store to service_mask is guaranteed to be visible
on all cores if this store is visible. Why do we care about this property?
The service_mask is core local anway.

4) Even with the load ACQ service_mask, and REL num_mapped_cores store, is there not still a race-condition possible where 2 lcores simultaneously load-ACQ the service_mask, and then both do atomic add/sub_fetch with REL?

5) Assuming 4 above race is true, it raises the real question - the service-cores control APIs are not designed to be multi-thread-safe. Orchestration of service/lcore mappings is not meant to be done by multiple threads at the same time. Documenting this loudly may help, I'm happy to send a patch to do so if we're agreed on the above?




>  	}
> 
>  	if (enabled)
>  		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
> 
> -	rte_smp_wmb();
> -
>  	return 0;
>  }
> 
> @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
>  		}
>  	}
>  	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
> -		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
> +		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
> +				    __ATOMIC_RELAXED);
> 
>  	rte_smp_wmb();
> 
> @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
>  		int32_t enabled = service_mask & (UINT64_C(1) << i);
>  		int32_t service_running = rte_service_runstate_get(i);
>  		int32_t only_core = (1 ==
> -			rte_atomic32_read(&rte_services[i].num_mapped_cores));
> +			__atomic_load_n(&rte_services[i].num_mapped_cores,
> +					__ATOMIC_RELAXED));
> 
>  		/* if the core is mapped, and the service is running, and this
>  		 * is the only core that is mapped, the service would cease to
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-06 17:06         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-03 11:58 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 12/12] service: relax barriers with C11 atomic operations
> 
> To guarantee the inter-threads visibility of the shareable domain, it
> uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
> these barriers for service by using c11 atomic one-way barrier operations.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++----------------
> -
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index c033224..d31663e 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -179,9 +179,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t
> enabled)
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> 
>  	if (enabled)
> -		s->internal_flags |= SERVICE_F_STATS_ENABLED;
> +		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
> +			__ATOMIC_RELEASE);
>  	else
> -		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
> +		__atomic_and_fetch(&s->internal_flags,
> +			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);

Not sure why these have to become stores with RELEASE memory ordering?
(More occurances of same Q below, just answer here?)

>  	return 0;
>  }
> @@ -193,9 +195,11 @@ rte_service_set_runstate_mapped_check(uint32_t id,
> int32_t enabled)
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> 
>  	if (enabled)
> -		s->internal_flags |= SERVICE_F_START_CHECK;
> +		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
> +			__ATOMIC_RELEASE);
>  	else
> -		s->internal_flags &= ~(SERVICE_F_START_CHECK);
> +		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
> +			__ATOMIC_RELEASE);

Same as above, why do these require RELEASE?


Remainder of patch below seems to make sense - there's a wmb() involved hence RELEASE m/o.

>  	return 0;
>  }
> @@ -264,8 +268,8 @@ rte_service_component_register(const struct
> rte_service_spec *spec,
>  	s->spec = *spec;
>  	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
> 
> -	rte_smp_wmb();
> -	rte_service_count++;
> +	/* make sure the counter update after the state change. */
> +	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);

This makes sense to me - the RELEASE ensures that previous stores to the
s->internal_flags are visible to other cores before rte_service_count
increments atomically.


>  	if (id_ptr)
>  		*id_ptr = free_slot;
> @@ -281,9 +285,10 @@ rte_service_component_unregister(uint32_t id)
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> 
>  	rte_service_count--;
> -	rte_smp_wmb();
> 
> -	s->internal_flags &= ~(SERVICE_F_REGISTERED);
> +	/* make sure the counter update before the state change. */
> +	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
> +			   __ATOMIC_RELEASE);
> 
>  	/* clear the run-bit in all cores */
>  	for (i = 0; i < RTE_MAX_LCORE; i++)
> @@ -301,11 +306,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t
> runstate)
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> 
>  	if (runstate)
> -		s->comp_runstate = RUNSTATE_RUNNING;
> +		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
> +				__ATOMIC_RELEASE);
>  	else
> -		s->comp_runstate = RUNSTATE_STOPPED;
> +		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
> +				__ATOMIC_RELEASE);
> 
> -	rte_smp_wmb();
>  	return 0;
>  }
>
> 
> @@ -316,11 +322,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> 
>  	if (runstate)
> -		s->app_runstate = RUNSTATE_RUNNING;
> +		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
> +				__ATOMIC_RELEASE);
>  	else
> -		s->app_runstate = RUNSTATE_STOPPED;
> +		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
> +				__ATOMIC_RELEASE);
> 
> -	rte_smp_wmb();
>  	return 0;
>  }
> 
> @@ -442,7 +449,8 @@ service_runner_func(void *arg)
>  	const int lcore = rte_lcore_id();
>  	struct core_state *cs = &lcore_states[lcore];
> 
> -	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
> +	while (__atomic_load_n(&cs->runstate,
> +		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
>  		const uint64_t service_mask = cs->service_mask;
> 
>  		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> @@ -453,8 +461,6 @@ service_runner_func(void *arg)
>  		}
> 
>  		cs->loops++;
> -
> -		rte_smp_rmb();
>  	}
> 
>  	lcore_config[lcore].state = WAIT;
> @@ -663,9 +669,8 @@ rte_service_lcore_add(uint32_t lcore)
> 
>  	/* ensure that after adding a core the mask and state are defaults */
>  	lcore_states[lcore].service_mask = 0;
> -	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
> -
> -	rte_smp_wmb();
> +	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
> +			__ATOMIC_RELEASE);
> 
>  	return rte_eal_wait_lcore(lcore);
>  }
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-03 11:58       ` Van Haaren, Harry
@ 2020-04-04 18:03         ` Honnappa Nagarahalli
  2020-04-08 18:05           ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-04 18:03 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, Honnappa Nagarahalli, nd

<snip>

> > Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> > service
> >
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > There has possible that a MT unsafe service might get configured to
> > run on another core while the service is running currently. This might
> > result in the MT unsafe service running on multiple cores
> > simultaneously. Use 'execute_lock' always when the service is MT
> > unsafe.
> >
> > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> We should put "fix" in the title, once converged on an implementation.
Ok, will replace 'avoid' with 'fix' (once we agree on the solution)

> 
> Regarding Fixes and stable backport, we should consider if fixing this in stable
> with a performance degradation, fixing with more complex solution, or
> documenting a known issue a better solution.
> 
> 
> This fix (always taking the atomic lock) will have a negative performance
> impact on existing code using services. We should investigate a way to fix it
> without causing datapath performance degradation.
Trying to gauge the impact on the existing applications...
The documentation does not explicitly disallow run time mapping of cores to service.
1) If the applications are mapping the cores to services at run time, they are running with a bug. IMO, bug fix resulting in a performance drop should be acceptable.
2) If the service is configured to run on single core (num_mapped_cores == 1), but service is set to MT unsafe - this will have a (possible) performance impact.
	a) This can be solved by setting the service to MT safe and can be documented. This might be a reasonable solution for applications which are compiling with
                   future DPDK releases.
	b) We can also solve this using symbol versioning - the old version of this function will use the old code, the new version of this function will use the code in
                   this patch. So, if the application is run with future DPDK releases without recompiling, it will continue to use the old version. If the application is compiled 
                   with future releases, they can use solution in 2a. We also should think if this is an appropriate solution as this would force 1) to recompile to get the fix.
3) If the service is configured to run on multiple cores (num_mapped_cores > 1), then for those applications, the lock is being taken already. These applications might see some improvements as this patch removes few instructions.

> 
> I think there is a way to achieve this by moving more checks/time to the
> control path (lcore updating the map), and not forcing the datapath lcore to
> always take an atomic.
I think 2a above is the solution.

> 
> In this particular case, we have a counter for number of iterations that a
Which counter are you thinking about?
All the counters I checked are not atomic operations currently. If we are going to use counters they have to be atomic, which means additional cycles in the data path.

> service has done. If this increments we know that the lcore running the
> service has re-entered the critical section, so would see an updated "needs
> atomic" flag.
> 
> This approach may introduce a predictable branch on the datapath, however
> the cost of a predictable branch vs always taking an atomic is order(s?) of
> magnitude, so a branch is much preferred.
> 
> It must be possible to avoid the datapath overhead using a scheme like this. It
> will likely be more complex than your proposed change below, however if it
> avoids datapath performance drops I feel that a more complex solution is
> worth investigating at least.
I do not completely understand the approach you are proposing, may be you can elaborate more. But, it seems to be based on a counter approach. Following is my assessment on what happens if we use a counter. Let us say we kept track of how many cores are running the service currently. We need an atomic counter other than 'num_mapped_cores'. Let us call that counter 'num_current_cores'. The code to call the service would look like below.

1) rte_atomic32_inc(&num_current_cores); /* this results in a full memory barrier */
2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) { /* rte_atomic_read is not enough here as it does not provide the required memory barrier for any architecture */
3) 	run_service(); /* Call the service */
4) }
5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear is not enough as it is not an atomic operation and does not provide the required memory barrier */

But the above code has race conditions in lines 1 and 2. It is possible that none of the cores will ever get to run the service as they all could simultaneously increment the counter. Hence lines 1 and 2 together need to be atomic, which is nothing but 'compare-exchange' operation.

BTW, the current code has a bug where it calls 'rte_atomic_clear(&s->execute_lock)', it is missing memory barriers which results in clearing the execute_lock before the service has completed running. I suggest changing the 'execute_lock' to rte_spinlock_t and using rte_spinlock_try_lock and rte_spinlock_unlock APIs.

> 
> A unit test is required to validate a fix like this - although perhaps found by
> inspection/review, a real-world test to validate would give confidence.
Agree, need to have a test case.

> 
> 
> Thoughts on such an approach?
> 
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 11 +++++------
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 557b5a9..32a2f8a 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
> >  	uint8_t internal_flags;
> >
> >  	/* per service statistics */
> > +	/* Indicates how many cores the service is mapped to run on.
> > +	 * It does not indicate the number of cores the service is running
> > +	 * on currently.
> > +	 */
> >  	rte_atomic32_t num_mapped_cores;
> >  	uint64_t calls;
> >  	uint64_t cycles_spent;
> > @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs,
> > uint64_t service_mask,
> >
> >  	cs->service_active_on_lcore[i] = 1;
> >
> > -	/* check do we need cmpset, if MT safe or <= 1 core
> > -	 * mapped, atomic ops are not required.
> > -	 */
> > -	const int use_atomics = (service_mt_safe(s) == 0) &&
> > -				(rte_atomic32_read(&s-
> >num_mapped_cores) > 1);
> > -	if (use_atomics) {
> > +	if (service_mt_safe(s) == 0) {
> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> >  			return -EBUSY;
> >
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly
  2020-04-03 11:58       ` Van Haaren, Harry
@ 2020-04-05  2:43         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-05  2:43 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, Honnappa Nagarahalli, nd

<snip>

> > Subject: [PATCH v3 10/12] service: identify service running on another
> > core correctly
> >
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > The logic to identify if the MT unsafe service is running on another
> > core can return -EBUSY spuriously. In such cases, running the service
> > becomes costlier than using atomic operations. Assume that the
> > application passes the right parameters and reduces the number of
> > instructions for all cases.
> >
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> Is this fixing broken functionality, or does it aim to only "optimize"?
> Lack of "fixes" tag suggests optimization.
Good point, it is missing the 'Fixes' tag. Relooking at the code, following are few problems that I observe, please correct me if I am wrong:
1) Looking at the API rte_service_map_lcore_set, 'num_mapped_cores' keeps track of the number of cores the service is mapped to. However, in the function 'rte_service_run_iter_on_app_lcore', the 'num_mapped_cores' is incremented only if 'serialize_mt_unsafe' is set. This will return incorrect result in 'rte_service_runstate_get' API when 'serialize_mt_unsafe' is not set.

2) Since incrementing 'num_mapped_cores' and checking its value is not an atomic operation, it introduces race conditions. Consider the current code snippet from the function 'rte_service_run_iter_on_app_lcore'.

1) if (serialize_mt_unsafe)
2)		rte_atomic32_inc(&s->num_mapped_cores);
3) if (service_mt_safe(s) == 0 && rte_atomic32_read(&s->num_mapped_cores) > 1) {
4) 		if (serialize_mt_unsafe)
5)			rte_atomic32_dec(&s->num_mapped_cores);
6)		return -EBUSY;
7)	}

It is possible that more than 1 thread is executing line 3) concurrently, in which case all of them will hit line 6). Due to this, it is possible that the service might never run or it might take multiple attempts to run the service wasting cycles.

If you agree these are bugs, I can add 'fixes' tag.

> 
> I'm cautious about the commit phrase "Assume that the application ...", if the
> code was previously checking things, we must not stop checking them now,
> this may introduce race-conditions in existing applications?
What I meant here is, let us believe in what the application says. i.e. if the applications sets 'serialize_mt_unsafe' to 1, let us assume that the service is infact 'unsafe'.

> 
> It seems like the "serialize_mt_unsafe" branch is being pushed further down
> the callgraph, and instead of branching over atomics this patch forces always
> executing 2 atomics?
Yes, that is correct. Please see explanation in 1) above.

> 
> This feels like too specific an optimization/tradeoff, without data to backup
> that there are no regressions on any DPDK supported platforms.
Apologies for missing the 'Fixes' tag, this patch is a bug fix.

> 
> DPDK today doesn't have a micro-benchmark to gather such perf data, but I
> would welcome one and we can have a data-driven decision.
When this was developed, was the logic to avoid taking the lock measured to provide any improvements?

> 
> Hope this point-of-view makes sense, -Harry
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
> >  1 file changed, 8 insertions(+), 18 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 32a2f8a..0843c3c 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -360,7 +360,7 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >  /* Expects the service 's' is valid. */  static int32_t
> > service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
> > -	    struct rte_service_spec_impl *s)
> > +	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
> >  {
> >  	if (!s)
> >  		return -EINVAL;
> > @@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs,
> > uint64_t service_mask,
> >
> >  	cs->service_active_on_lcore[i] = 1;
> >
> > -	if (service_mt_safe(s) == 0) {
> > +	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> >  			return -EBUSY;
> >
> > @@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> > uint32_t
> > serialize_mt_unsafe)
> >
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> > -	/* Atomically add this core to the mapped cores first, then examine if
> > -	 * we can run the service. This avoids a race condition between
> > -	 * checking the value, and atomically adding to the mapped count.
> > +	/* Increment num_mapped_cores to indicate that the service
> > +	 * is running on a core.
> >  	 */
> > -	if (serialize_mt_unsafe)
> > -		rte_atomic32_inc(&s->num_mapped_cores);
> > +	rte_atomic32_inc(&s->num_mapped_cores);
> >
> > -	if (service_mt_safe(s) == 0 &&
> > -			rte_atomic32_read(&s->num_mapped_cores) > 1) {
> > -		if (serialize_mt_unsafe)
> > -			rte_atomic32_dec(&s->num_mapped_cores);
> > -		return -EBUSY;
> > -	}
> > -
> > -	int ret = service_run(id, cs, UINT64_MAX, s);
> > +	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> >
> > -	if (serialize_mt_unsafe)
> > -		rte_atomic32_dec(&s->num_mapped_cores);
> > +	rte_atomic32_dec(&s->num_mapped_cores);
> >
> >  	return ret;
> >  }
> > @@ -449,7 +439,7 @@ service_runner_func(void *arg)
> >  			if (!service_valid(i))
> >  				continue;
> >  			/* return value ignored as no change to code flow */
> > -			service_run(i, cs, service_mask, service_get(i));
> > +			service_run(i, cs, service_mask, service_get(i), 1);
> >  		}
> >
> >  		cs->loops++;
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/12] service: remove redundant code
  2020-04-03 11:58       ` Van Haaren, Harry
@ 2020-04-05 18:35         ` Honnappa Nagarahalli
  2020-04-08 10:15           ` Phil Yang
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-05 18:35 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Stable, Honnappa Nagarahalli, nd

<snip>

> >
> > The service id validation is verified in the calling function, remove
> > the redundant code inside the service_update function.
> >
> > Fixes: 21698354c832 ("service: introduce service cores concept")
> > Cc: Stable@dpdk.org
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> 
> Same comment as patch 7/12, is this really a "Fix"? This functionality is not
> "broken" in  the current code? And is there value in porting to stable? I'd see
> this as unnecessary churn.
> 
> As before, it is a valid cleanup (thanks), and I'd like to take it for new DPDK
> releases.
> 
> Happy to Ack without Fixes or Cc Stable, if that's acceptable to you?
Agreed.

> 
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 31
> > ++++++++++++-------------------
> >  1 file changed, 12 insertions(+), 19 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 2117726..557b5a9 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -552,21 +552,10 @@ rte_service_start_with_defaults(void)
> >  }
> >
> >  static int32_t
> > -service_update(struct rte_service_spec *service, uint32_t lcore,
> > +service_update(uint32_t sid, uint32_t lcore,
> >  		uint32_t *set, uint32_t *enabled)
'set' parameter does not need be passed by reference, pass by value is enough.

> >  {
> > -	uint32_t i;
> > -	int32_t sid = -1;
> > -
> > -	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> > -		if ((struct rte_service_spec *)&rte_services[i] == service &&
> > -				service_valid(i)) {
> > -			sid = i;
> > -			break;
> > -		}
> > -	}
> > -
> > -	if (sid == -1 || lcore >= RTE_MAX_LCORE)
> > +	if (lcore >= RTE_MAX_LCORE)
> >  		return -EINVAL;
The validations look somewhat inconsistent in service_update function, we are validating some parameters and not some.
Suggest bringing the validation of the service id also into this function and remove it from the calling functions.

> >
> >  	if (!lcore_states[lcore].is_service_core)
> > @@ -598,19 +587,23 @@ service_update(struct rte_service_spec *service,
> > uint32_t lcore,  int32_t  rte_service_map_lcore_set(uint32_t id,
> > uint32_t lcore, uint32_t enabled)  {
> > -	struct rte_service_spec_impl *s;
> > -	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> > +	/* validate ID, or return error value */
> > +	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> > +		return -EINVAL;
> > +
> >  	uint32_t on = enabled > 0;
We do not need the above line. 'enabled' can be passed directly to 'service_update'.

> > -	return service_update(&s->spec, lcore, &on, 0);
> > +	return service_update(id, lcore, &on, 0);
> >  }
> >
> >  int32_t
> >  rte_service_map_lcore_get(uint32_t id, uint32_t lcore)  {
> > -	struct rte_service_spec_impl *s;
> > -	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> > +	/* validate ID, or return error value */
> > +	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> > +		return -EINVAL;
> > +
> >  	uint32_t enabled;
> > -	int ret = service_update(&s->spec, lcore, 0, &enabled);
> > +	int ret = service_update(id, lcore, 0, &enabled);
> >  	if (ret == 0)
> >  		return enabled;
> >  	return ret;
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
  2020-04-03 11:57       ` Van Haaren, Harry
@ 2020-04-05 21:35       ` Honnappa Nagarahalli
  2020-04-08 10:14         ` Phil Yang
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
  2 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-05 21:35 UTC (permalink / raw)
  To: Phil Yang, thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, Honnappa Nagarahalli, nd

<snip>

> Subject: [PATCH v3 07/12] service: remove rte prefix from static functions
> 
> Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
> Fixes: 21698354c832 ("service: introduce service cores concept")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  lib/librte_eal/common/rte_service.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index b0b78ba..2117726 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -336,7 +336,7 @@ rte_service_runstate_get(uint32_t id)  }
> 
>  static inline void
> -rte_service_runner_do_callback(struct rte_service_spec_impl *s,
> +service_runner_do_callback(struct rte_service_spec_impl *s,
>  			       struct core_state *cs, uint32_t service_idx)  {
>  	void *userdata = s->spec.callback_userdata; @@ -379,10 +379,10
> @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
>  			return -EBUSY;
> 
> -		rte_service_runner_do_callback(s, cs, i);
> +		service_runner_do_callback(s, cs, i);
>  		rte_atomic32_clear(&s->execute_lock);
>  	} else
> -		rte_service_runner_do_callback(s, cs, i);
> +		service_runner_do_callback(s, cs, i);
> 
>  	return 0;
>  }
> @@ -436,7 +436,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> uint32_t serialize_mt_unsafe)  }
> 
>  static int32_t
> -rte_service_runner_func(void *arg)
> +service_runner_func(void *arg)
>  {
>  	RTE_SET_USED(arg);
>  	uint32_t i;
This is a minor comment.
Since you are touching 'service_runner_func', please consider doing the below improvement:

struct core_state *cs = &lcore_states[lcore];
while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {   

The while above can be changed as follows to make it more readable

while (cs->runstate == RUNSTATE_RUNNING) {   

> @@ -706,7 +706,7 @@ rte_service_lcore_start(uint32_t lcore)
>  	 */
>  	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
> 
> -	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
> +	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
>  	/* returns -EBUSY if the core is already launched, 0 on success */
>  	return ret;
>  }
> @@ -785,7 +785,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t
> attr_id,  }
> 
<snip>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier
  2020-04-03 11:58       ` Van Haaren, Harry
@ 2020-04-06  4:22         ` Honnappa Nagarahalli
  2020-04-08 10:15         ` Phil Yang
  1 sibling, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-06  4:22 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> >
> > The num_mapped_cores and execute_lock are synchronized with
> > rte_atomic_XX APIs which is a full barrier, DMB, on aarch64. This
> > patch optimized it with
> > c11 atomic one-way barrier.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Based on discussion on-list, it seems the consensus is to not use GCC builtins,
> but instead use C11 APIs "proper"? If my conclusion is correct, the v+1 of this
> patchset would require updates to that style API.
> 
> Inline comments for context below, -Harry
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 50
> > ++++++++++++++++++++++++++----------
> > -
> >  1 file changed, 35 insertions(+), 15 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 0843c3c..c033224 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
> >  	 * running this service callback. When not set, a core may take the
> >  	 * lock and then run the service callback.
> >  	 */
> > -	rte_atomic32_t execute_lock;
> > +	uint32_t execute_lock;
> >
> >  	/* API set/get-able variables */
> >  	int8_t app_runstate;
> > @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
> >  	 * It does not indicate the number of cores the service is running
> >  	 * on currently.
> >  	 */
> > -	rte_atomic32_t num_mapped_cores;
> > +	int32_t num_mapped_cores;
> 
> Any reason why "int32_t" or "uint32_t" is used over another?
> execute_lock is a uint32_t above, num_mapped_cores is an int32_t?
> 
> 
> >  	uint64_t calls;
> >  	uint64_t cycles_spent;
> >  } __rte_cache_aligned;
> > @@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
> >  	rte_smp_rmb();
> >
> >  	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
> > -	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) >
> 0);
> > +	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
> > +					    __ATOMIC_RELAXED) > 0);
> >
> >  	return (s->app_runstate == RUNSTATE_RUNNING) &&
> >  		(s->comp_runstate == RUNSTATE_RUNNING) && @@ -375,11
> +376,20 @@
> > service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
> >  	cs->service_active_on_lcore[i] = 1;
> >
> >  	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
> > -		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> > +		uint32_t expected = 0;
> > +		/* ACQUIRE ordering here is to prevent the callback
> > +		 * function from hoisting up before the execute_lock
> > +		 * setting.
> > +		 */
> > +		if (!__atomic_compare_exchange_n(&s->execute_lock,
> &expected, 1,
> > +			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
> >  			return -EBUSY;
> 
> Let's try improve the magic "1" and "0" constants, I believe the "1" here is the
> desired "new value on success", and the 0 is "bool weak", where our 0/false
> constant implies a strongly ordered compare exchange?
> 
> "Weak is true for weak compare_exchange, which may fail spuriously, and
> false for the strong variation, which never fails spuriously.", from
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> 
> const uint32_t on_success_value = 1;
> const bool weak = 0;
> __atomic_compare_exchange_n(&s->execute_lock, &expected,
> on_success_value, weak, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
> 
> 
> Although a bit more verbose, I feel this documents usage a lot better,
> particularly for those who aren't as familiar with the C11 function arguments
> order.
> 
> Admittedly with the API change to not use __builtins, perhaps this comment is
> moot.
Suggest changing the execute_lock to rte_spinlock_t and use rte_spinlock_trylock API.

> 
> 
> >
> >  		service_runner_do_callback(s, cs, i);
> > -		rte_atomic32_clear(&s->execute_lock);
> > +		/* RELEASE ordering here is used to pair with ACQUIRE
> > +		 * above to achieve lock semantic.
> > +		 */
> > +		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
> >  	} else
> >  		service_runner_do_callback(s, cs, i);
> >
> > @@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> > uint32_t
> > serialize_mt_unsafe)
> >  	/* Increment num_mapped_cores to indicate that the service
> >  	 * is running on a core.
> >  	 */
> > -	rte_atomic32_inc(&s->num_mapped_cores);
> > +	__atomic_add_fetch(&s->num_mapped_cores, 1,
> __ATOMIC_ACQUIRE);
> >
> >  	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
> >
> > -	rte_atomic32_dec(&s->num_mapped_cores);
> > +	__atomic_sub_fetch(&s->num_mapped_cores, 1,
> __ATOMIC_RELEASE);
> >
> >  	return ret;
> >  }
> > @@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
> >
> >  	uint64_t sid_mask = UINT64_C(1) << sid;
> >  	if (set) {
> > -		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
> > -			sid_mask;
> > +		/* When multiple threads try to update the same lcore
> > +		 * service concurrently, e.g. set lcore map followed
> > +		 * by clear lcore map, the unsynchronized service_mask
> > +		 * values have issues on the num_mapped_cores value
> > +		 * consistency. So we use ACQUIRE ordering to pair with
> > +		 * the RELEASE ordering to synchronize the service_mask.
> > +		 */
> > +		uint64_t lcore_mapped = __atomic_load_n(
> > +					&lcore_states[lcore].service_mask,
> > +					__ATOMIC_ACQUIRE) & sid_mask;
> 
> Thanks for the comment - it helps me understand things a bit better.
> Some questions/theories to validate;
> 1) The service_mask ACQUIRE avoids other loads being hoisted above it,
> correct?
> 
> 2) There are non-atomic stores to service_mask. Is it correct that the stores
> themselves aren't the issue, but relative visibility of service_mask stores vs
> num_mapped_cores? (Detail in (3) below)
> 
> 
> >  		if (*set && !lcore_mapped) {
> >  			lcore_states[lcore].service_mask |= sid_mask;
> > -
> 	rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
> > +
> 	__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
> > +					    1, __ATOMIC_RELEASE);
> >  		}
> >  		if (!*set && lcore_mapped) {
> >  			lcore_states[lcore].service_mask &= ~(sid_mask);
> > -
> 	rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
> > +
> 	__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
> > +					    1, __ATOMIC_RELEASE);
> >  		}
> 
> 3) Here we update the core-local service_mask, and then update the
> num_mapped_cores with an ATOMIC_RELEASE. The RELEASE here ensures
> that the previous store to service_mask is guaranteed to be visible on all cores
> if this store is visible. Why do we care about this property?
> The service_mask is core local anway.
We are working on concurrency between the reader and writer. The service_mask is local to the core, but it is accessed by a reader and writer.
I think we should wait to conclude on the meaning of 'num_mapped_cores', that will dictate what the order should be. For ex: if it is just for statistics purpose, then we could use just RELAXED memory order and then the order for service_mask will also change.

> 
> 4) Even with the load ACQ service_mask, and REL num_mapped_cores store,
> is there not still a race-condition possible where 2 lcores simultaneously load-
> ACQ the service_mask, and then both do atomic add/sub_fetch with REL?
> 
> 5) Assuming 4 above race is true, it raises the real question - the service-cores
> control APIs are not designed to be multi-thread-safe. Orchestration of
> service/lcore mappings is not meant to be done by multiple threads at the
> same time. Documenting this loudly may help, I'm happy to send a patch to do
> so if we're agreed on the above?
I completely agree here. writer-writer concurrency is another topic and we should (for now at least) say that the control plane APIs are not thread safe.

> 
> 
> 
> 
> >  	}
> >
> >  	if (enabled)
> >  		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
> >
> > -	rte_smp_wmb();
> > -
> >  	return 0;
> >  }
> >
> > @@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
> >  		}
> >  	}
> >  	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
> > -		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
> > +		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
> > +				    __ATOMIC_RELAXED);
> >
> >  	rte_smp_wmb();
> >
> > @@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
> >  		int32_t enabled = service_mask & (UINT64_C(1) << i);
> >  		int32_t service_running = rte_service_runstate_get(i);
> >  		int32_t only_core = (1 ==
> > -
> 	rte_atomic32_read(&rte_services[i].num_mapped_cores));
> > +
> 	__atomic_load_n(&rte_services[i].num_mapped_cores,
> > +					__ATOMIC_RELAXED));
> >
> >  		/* if the core is mapped, and the service is running, and this
> >  		 * is the only core that is mapped, the service would cease to
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations
  2020-04-03 11:58       ` Van Haaren, Harry
@ 2020-04-06 17:06         ` Honnappa Nagarahalli
  2020-04-08 19:42           ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-06 17:06 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>
Just to get us on the same page on 'techniques to communicate data from writer to reader' (apologies if it is too trivial)

Let us say that the writer has 512B (key point is - cannot be written atomically) that needs to be communicated to the reader.

Since the data cannot be written atomically, we need a guard variable (which can be written atomically, can be a flag or pointer to data). So, the writer will store 512B in non-atomic way and write to guard variable with release memory order. So, if the guard variable is valid (set in the case of flag or not null in the case of pointer), it guarantees that 512B is written.

The reader will read the guard variable with acquire memory order and read the 512B data only if the guard variable is valid. So, the acquire memory order on the guard variable guarantees that the load of 512B does not happen before the guard variable is read. The validity check on the guard variable guarantees that 512B was written before it was read.

The store(guard_variable, RELEASE) on the writer and the load(guard_variable, ACQUIRE) can be said as synchronizing with each other.

(the guard variable technique applies even if we are not using C11 atomics)

Let us say that the writer has 4B (key point is - can be written atomically) that needs to be communicated to the reader. The writer is free to write this atomically with no constraints on memory ordering as long as this data is not acting as a guard variable for any other data.

In my understanding, the sequence of APIs to call to start a service (writer) are as follows:
1) rte_service_init
2) rte_service_component_register
3) <possible configuration of the service>
4) rte_service_component_runstate_set (the reader is allowed at this point to read the information about the service - written by rte_service_component_register API. This API should not be called before rte_service_component_register)
5) <possible configuration of the service>
6) rte_service_runstate_set (the reader is allowed at this point to read the information about the service - written by rte_service_component_register API and run the service. This API can be called anytime. But, the reader should not attempt to run the service before this API is called)
7) rte_lcore_service_add (multiple of these probably, can be called before this, can't be called later)
8) rte_service_map_lcore_set (this can be called anytime. Can be called even if the service is not registered)
9) rte_service_lcore_start (again, this can be called anytime, even before the service is registered)

So, there are 2 guard variables - 'comp_runstate' and 'app_runstate'. Only these 2 need to have RELEASE ordering in writer and ACQUIRE ordering in reader.

We can write test cases with different orders of these API calls to prove that the memory orders we use are sufficient.

Few comments are inline based on this assessment.

> Subject: RE: [PATCH v3 12/12] service: relax barriers with C11 atomic
> operations
> 
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Tuesday, March 17, 2020 1:18 AM
> > To: thomas@monjalon.net; Van Haaren, Harry
> > <harry.van.haaren@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> > maxime.coquelin@redhat.com; dev@dpdk.org
> > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> > hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> > gavin.hu@arm.com; ruifeng.wang@arm.com; joyce.kong@arm.com;
> nd@arm.com
> > Subject: [PATCH v3 12/12] service: relax barriers with C11 atomic
> > operations
> >
> > To guarantee the inter-threads visibility of the shareable domain, it
> > uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
> > these barriers for service by using c11 atomic one-way barrier operations.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> >  lib/librte_eal/common/rte_service.c | 45
> > ++++++++++++++++++++----------------
> > -
> >  1 file changed, 25 insertions(+), 20 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index c033224..d31663e 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -179,9 +179,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t
> > enabled)
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> >
> >  	if (enabled)
> > -		s->internal_flags |= SERVICE_F_STATS_ENABLED;
> > +		__atomic_or_fetch(&s->internal_flags,
> SERVICE_F_STATS_ENABLED,
> > +			__ATOMIC_RELEASE);
> >  	else
> > -		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
> > +		__atomic_and_fetch(&s->internal_flags,
> > +			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
> 
> Not sure why these have to become stores with RELEASE memory ordering?
> (More occurances of same Q below, just answer here?)
Agree, 'internal_flags' is not acting as a guard variable, this should be RELAXED (similarly for the others below). Though I suggest keeping it atomic.

> 
> >  	return 0;
> >  }
> > @@ -193,9 +195,11 @@ rte_service_set_runstate_mapped_check(uint32_t
> > id, int32_t enabled)
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
> >
> >  	if (enabled)
> > -		s->internal_flags |= SERVICE_F_START_CHECK;
> > +		__atomic_or_fetch(&s->internal_flags,
> SERVICE_F_START_CHECK,
> > +			__ATOMIC_RELEASE);
> >  	else
> > -		s->internal_flags &= ~(SERVICE_F_START_CHECK);
> > +		__atomic_and_fetch(&s->internal_flags,
> ~(SERVICE_F_START_CHECK),
> > +			__ATOMIC_RELEASE);
> 
> Same as above, why do these require RELEASE?
Agree

> 
> 
> Remainder of patch below seems to make sense - there's a wmb() involved
> hence RELEASE m/o.
> 
> >  	return 0;
> >  }
> > @@ -264,8 +268,8 @@ rte_service_component_register(const struct
> > rte_service_spec *spec,
> >  	s->spec = *spec;
> >  	s->internal_flags |= SERVICE_F_REGISTERED |
> SERVICE_F_START_CHECK;
> >
> > -	rte_smp_wmb();
> > -	rte_service_count++;
> > +	/* make sure the counter update after the state change. */
> > +	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
> 
> This makes sense to me - the RELEASE ensures that previous stores to the
> s->internal_flags are visible to other cores before rte_service_count
> increments atomically.
rte_service_count is not a guard variable, does not need RELEASE order. It is also not used by the reader. It looks like it is just a statistic being maintained.

> 
> 
> >  	if (id_ptr)
> >  		*id_ptr = free_slot;
> > @@ -281,9 +285,10 @@ rte_service_component_unregister(uint32_t id)
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >  	rte_service_count--;
> > -	rte_smp_wmb();
> >
> > -	s->internal_flags &= ~(SERVICE_F_REGISTERED);
> > +	/* make sure the counter update before the state change. */
> > +	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
> > +			   __ATOMIC_RELEASE);
RELAXED is enough.

> >
> >  	/* clear the run-bit in all cores */
> >  	for (i = 0; i < RTE_MAX_LCORE; i++)
> > @@ -301,11 +306,12 @@ rte_service_component_runstate_set(uint32_t id,
> > uint32_t
> > runstate)
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >  	if (runstate)
> > -		s->comp_runstate = RUNSTATE_RUNNING;
> > +		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
> > +				__ATOMIC_RELEASE);
> >  	else
> > -		s->comp_runstate = RUNSTATE_STOPPED;
> > +		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
> > +				__ATOMIC_RELEASE);
Here we need a thread_fence to prevent the memory operations from a subsequent call to 'rte_service_component_unregister' from getting hoisted above this. The user should be forced to call rte_service_component_unregister before calling rte_service_component_runstate_set. I suggest adding a check in rte_service_component_unregister to ensure that the state is set to RUNSTATE_STOPPED. In fact, the user needs to make sure that the service is stopped for sure before calling rte_service_component_unregister.

> >
> > -	rte_smp_wmb();
> >  	return 0;
> >  }
> >
> >
> > @@ -316,11 +322,12 @@ rte_service_runstate_set(uint32_t id, uint32_t
> runstate)
> >  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> >
> >  	if (runstate)
> > -		s->app_runstate = RUNSTATE_RUNNING;
> > +		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
> > +				__ATOMIC_RELEASE);
> >  	else
> > -		s->app_runstate = RUNSTATE_STOPPED;
> > +		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
> > +				__ATOMIC_RELEASE);
> >
> > -	rte_smp_wmb();
> >  	return 0;
> >  }
> >
> > @@ -442,7 +449,8 @@ service_runner_func(void *arg)
> >  	const int lcore = rte_lcore_id();
> >  	struct core_state *cs = &lcore_states[lcore];
> >
> > -	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
> > +	while (__atomic_load_n(&cs->runstate,
> > +		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
This can be RELAXED, lcore's runstate is not acting as a guard variable.
However, note that the writer thread wants to communicate the 'runstate' (4B) to the reader thread. This ordering needs to be handled in 'rte_eal_remote_launch' and 'eal_thread_loop' functions. We have to note that in some other use case, the writer wants to communicate more than 4B to reader. Currently, the 'write' and 'read' system calls may have enough barriers to make things work fine. But, I suggest using the ' lcore_config[lcore].f' as the guard variable to make it explicit and not depend on 'write' and 'read'. We can take up the EAL things in a later patch as it does not cause any issues right now.

> >  		const uint64_t service_mask = cs->service_mask;
> >
> >  		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) { @@ -453,8
> +461,6 @@
> > service_runner_func(void *arg)
> >  		}
> >
> >  		cs->loops++;
> > -
> > -		rte_smp_rmb();
> >  	}
> >
> >  	lcore_config[lcore].state = WAIT;
> > @@ -663,9 +669,8 @@ rte_service_lcore_add(uint32_t lcore)
> >
> >  	/* ensure that after adding a core the mask and state are defaults */
> >  	lcore_states[lcore].service_mask = 0;
> > -	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
> > -
> > -	rte_smp_wmb();
Agree.

> > +	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
> > +			__ATOMIC_RELEASE);
This can be relaxed.

> >
> >  	return rte_eal_wait_lcore(lcore);
> >  }
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-04-03 11:57       ` Van Haaren, Harry
@ 2020-04-08 10:14         ` Phil Yang
  2020-04-08 10:36           ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-04-08 10:14 UTC (permalink / raw)
  To: Van Haaren, Harry, thomas, Ananyev, Konstantin, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, stable, nd

> -----Original Message-----
> From: Van Haaren, Harry <harry.van.haaren@intel.com>
> Sent: Friday, April 3, 2020 7:58 PM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org
> Subject: RE: [PATCH v3 07/12] service: remove rte prefix from static functions
> 
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Tuesday, March 17, 2020 1:18 AM
> > To: thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>;
> > Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com;
> ruifeng.wang@arm.com;
> > joyce.kong@arm.com; nd@arm.com; stable@dpdk.org
> > Subject: [PATCH v3 07/12] service: remove rte prefix from static functions
> >
> > Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
> > Fixes: 21698354c832 ("service: introduce service cores concept")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> 
> This patchset needs a rebase since the EAL file movement got merged,
> however I'll review here so we can include some Acks etc and make
> progress.
> 
> Is this really a "Fix"? The internal function names were not exported
> in the .map file, so are not part of public ABI. This is an internal
> naming improvement (thanks for doing cleanup), but I don't think the
> Fixes: tags make sense?
> 
> Also I'm not sure if we want to port this patch back to stable? Changing
> (internal) function names seems like unnecessary churn, and hence risk to a
> stable release, without any benefit?
OK.
I will remove these tags in the next version and split the service core patches from the original series into a series by itself.

Thanks,
Phil

> 
> ---
> 
> <snip patch diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-04-05 21:35       ` Honnappa Nagarahalli
@ 2020-04-08 10:14         ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-08 10:14 UTC (permalink / raw)
  To: Honnappa Nagarahalli, thomas, harry.van.haaren,
	konstantin.ananyev, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, April 6, 2020 5:35 AM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net;
> harry.van.haaren@intel.com; konstantin.ananyev@intel.com;
> stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 07/12] service: remove rte prefix from static functions
> 
> <snip>
> 
> > Subject: [PATCH v3 07/12] service: remove rte prefix from static functions
> >
> > Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
> > Fixes: 21698354c832 ("service: introduce service cores concept")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  lib/librte_eal/common/rte_service.c | 18 +++++++++---------
> >  1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index b0b78ba..2117726 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -336,7 +336,7 @@ rte_service_runstate_get(uint32_t id)  }
> >
> >  static inline void
> > -rte_service_runner_do_callback(struct rte_service_spec_impl *s,
> > +service_runner_do_callback(struct rte_service_spec_impl *s,
> >  			       struct core_state *cs, uint32_t service_idx)  {
> >  	void *userdata = s->spec.callback_userdata; @@ -379,10 +379,10
> > @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0,
> 1))
> >  			return -EBUSY;
> >
> > -		rte_service_runner_do_callback(s, cs, i);
> > +		service_runner_do_callback(s, cs, i);
> >  		rte_atomic32_clear(&s->execute_lock);
> >  	} else
> > -		rte_service_runner_do_callback(s, cs, i);
> > +		service_runner_do_callback(s, cs, i);
> >
> >  	return 0;
> >  }
> > @@ -436,7 +436,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> > uint32_t serialize_mt_unsafe)  }
> >
> >  static int32_t
> > -rte_service_runner_func(void *arg)
> > +service_runner_func(void *arg)
> >  {
> >  	RTE_SET_USED(arg);
> >  	uint32_t i;
> This is a minor comment.
> Since you are touching 'service_runner_func', please consider doing the
> below improvement:
> 
> struct core_state *cs = &lcore_states[lcore];
> while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
> 
> The while above can be changed as follows to make it more readable
> 
> while (cs->runstate == RUNSTATE_RUNNING) {
OK. I will clean it up in the next version.

Thanks,
Phil

> 
> > @@ -706,7 +706,7 @@ rte_service_lcore_start(uint32_t lcore)
> >  	 */
> >  	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
> >
> > -	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
> > +	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
> >  	/* returns -EBUSY if the core is already launched, 0 on success */
> >  	return ret;
> >  }
> > @@ -785,7 +785,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t
> > attr_id,  }
> >
> <snip>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 08/12] service: remove redundant code
  2020-04-05 18:35         ` Honnappa Nagarahalli
@ 2020-04-08 10:15           ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-08 10:15 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Van Haaren, Harry, thomas, Ananyev,
	Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Stable, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, April 6, 2020 2:35 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; Stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 08/12] service: remove redundant code
> 
> <snip>
> 
> > >
> > > The service id validation is verified in the calling function, remove
> > > the redundant code inside the service_update function.
> > >
> > > Fixes: 21698354c832 ("service: introduce service cores concept")
> > > Cc: Stable@dpdk.org
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> >
> > Same comment as patch 7/12, is this really a "Fix"? This functionality is not
> > "broken" in  the current code? And is there value in porting to stable? I'd
> see
> > this as unnecessary churn.
> >
> > As before, it is a valid cleanup (thanks), and I'd like to take it for new DPDK
> > releases.
> >
> > Happy to Ack without Fixes or Cc Stable, if that's acceptable to you?
> Agreed.

Agreed. 

> 
> >
> >
> >
> > > ---
> > >  lib/librte_eal/common/rte_service.c | 31
> > > ++++++++++++-------------------
> > >  1 file changed, 12 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/rte_service.c
> > > b/lib/librte_eal/common/rte_service.c
> > > index 2117726..557b5a9 100644
> > > --- a/lib/librte_eal/common/rte_service.c
> > > +++ b/lib/librte_eal/common/rte_service.c
> > > @@ -552,21 +552,10 @@ rte_service_start_with_defaults(void)
> > >  }
> > >
> > >  static int32_t
> > > -service_update(struct rte_service_spec *service, uint32_t lcore,
> > > +service_update(uint32_t sid, uint32_t lcore,
> > >  uint32_t *set, uint32_t *enabled)
> 'set' parameter does not need be passed by reference, pass by value is
> enough.
Agreed.
 
> 
> > >  {
> > > -uint32_t i;
> > > -int32_t sid = -1;
> > > -
> > > -for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> > > -if ((struct rte_service_spec *)&rte_services[i] == service &&
> > > -service_valid(i)) {
> > > -sid = i;
> > > -break;
> > > -}
> > > -}
> > > -
> > > -if (sid == -1 || lcore >= RTE_MAX_LCORE)
> > > +if (lcore >= RTE_MAX_LCORE)
> > >  return -EINVAL;
> The validations look somewhat inconsistent in service_update function, we
> are validating some parameters and not some.
> Suggest bringing the validation of the service id also into this function and
> remove it from the calling functions.
Agreed. I will update it in the next version.

> 
> > >
> > >  if (!lcore_states[lcore].is_service_core)
> > > @@ -598,19 +587,23 @@ service_update(struct rte_service_spec
> *service,
> > > uint32_t lcore,  int32_t  rte_service_map_lcore_set(uint32_t id,
> > > uint32_t lcore, uint32_t enabled)  {
> > > -struct rte_service_spec_impl *s;
> > > -SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> > > +/* validate ID, or return error value */
> > > +if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> > > +return -EINVAL;
> > > +
> > >  uint32_t on = enabled > 0;
> We do not need the above line. 'enabled' can be passed directly to
> 'service_update'.
Agreed.

> 
> > > -return service_update(&s->spec, lcore, &on, 0);
> > > +return service_update(id, lcore, &on, 0);
> > >  }
> > >
> > >  int32_t
> > >  rte_service_map_lcore_get(uint32_t id, uint32_t lcore)  {
> > > -struct rte_service_spec_impl *s;
> > > -SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> > > +/* validate ID, or return error value */
> > > +if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
> > > +return -EINVAL;
> > > +
> > >  uint32_t enabled;
> > > -int ret = service_update(&s->spec, lcore, 0, &enabled);
> > > +int ret = service_update(id, lcore, 0, &enabled);
> > >  if (ret == 0)
> > >  return enabled;
> > >  return ret;
> > > --
> > > 2.7.4
> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier
  2020-04-03 11:58       ` Van Haaren, Harry
  2020-04-06  4:22         ` Honnappa Nagarahalli
@ 2020-04-08 10:15         ` Phil Yang
  1 sibling, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-08 10:15 UTC (permalink / raw)
  To: Van Haaren, Harry, thomas, Ananyev, Konstantin, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, nd

> -----Original Message-----
> From: Van Haaren, Harry <harry.van.haaren@intel.com>
> Sent: Friday, April 3, 2020 7:58 PM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> 
> > -----Original Message-----
> > From: Phil Yang <phil.yang@arm.com>
> > Sent: Tuesday, March 17, 2020 1:18 AM
> > To: thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>;
> > Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com;
> > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com;
> ruifeng.wang@arm.com;
> > joyce.kong@arm.com; nd@arm.com
> > Subject: [PATCH v3 11/12] service: optimize with c11 one-way barrier
> >
> > The num_mapped_cores and execute_lock are synchronized with
> rte_atomic_XX
> > APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
> > c11 atomic one-way barrier.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Based on discussion on-list, it seems the consensus is to not use
> GCC builtins, but instead use C11 APIs "proper"? If my conclusion is
> correct, the v+1 of this patchset would require updates to that style API.
> 
> Inline comments for context below, -Harry
> 
> 
> > ---
> >  lib/librte_eal/common/rte_service.c | 50
> ++++++++++++++++++++++++++----------
> > -
> >  1 file changed, 35 insertions(+), 15 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 0843c3c..c033224 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -42,7 +42,7 @@ struct rte_service_spec_impl {
> >  	 * running this service callback. When not set, a core may take the
> >  	 * lock and then run the service callback.
> >  	 */
> > -	rte_atomic32_t execute_lock;
> > +	uint32_t execute_lock;
> >
> >  	/* API set/get-able variables */
> >  	int8_t app_runstate;
> > @@ -54,7 +54,7 @@ struct rte_service_spec_impl {
> >  	 * It does not indicate the number of cores the service is running
> >  	 * on currently.
> >  	 */
> > -	rte_atomic32_t num_mapped_cores;
> > +	int32_t num_mapped_cores;
> 
> Any reason why "int32_t" or "uint32_t" is used over another?
> execute_lock is a uint32_t above, num_mapped_cores is an int32_t?

It should be uint32_t for num_mapped_cores. 
This value will not be negative after __atomic_sub_fetch operation, because of the sequence of writer and reader accesses are guaranteed by the memory ordering.
I will update it in the next version.

Thanks,
Phil

<snip>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-04-08 10:14         ` Phil Yang
@ 2020-04-08 10:36           ` Van Haaren, Harry
  2020-04-08 10:49             ` Phil Yang
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-08 10:36 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, stable, nd

> -----Original Message-----
> From: Phil Yang <Phil.Yang@arm.com>
> Sent: Wednesday, April 8, 2020 11:15 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; thomas@monjalon.net;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu
> <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [PATCH v3 07/12] service: remove rte prefix from static functions
<snip>
> > Is this really a "Fix"? The internal function names were not exported
> > in the .map file, so are not part of public ABI. This is an internal
> > naming improvement (thanks for doing cleanup), but I don't think the
> > Fixes: tags make sense?
> >
> > Also I'm not sure if we want to port this patch back to stable? Changing
> > (internal) function names seems like unnecessary churn, and hence risk to a
> > stable release, without any benefit?
> OK.
> I will remove these tags in the next version and split the service core
> patches from the original series into a series by itself.

Cool - good idea to split.

Perhaps we should focus on getting bugfixes in for the existing code, before doing cleanup? It would make backports easier if churn is minimal.

Suggesting patches order (first to last)
1. bugfixes/things to backport
2. cleanups
3. C11 atomic optimizations


> Thanks,
> Phil

Thanks, and I'll get to reading/reviewing your and Honnappa's feedback later today.

-H 

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-04-08 10:36           ` Van Haaren, Harry
@ 2020-04-08 10:49             ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-08 10:49 UTC (permalink / raw)
  To: Van Haaren, Harry, thomas, Ananyev, Konstantin, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, stable, nd, nd

> -----Original Message-----
> From: Van Haaren, Harry <harry.van.haaren@intel.com>
> Sent: Wednesday, April 8, 2020 6:37 PM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH v3 07/12] service: remove rte prefix from static functions
> 
> > -----Original Message-----
> > From: Phil Yang <Phil.Yang@arm.com>
> > Sent: Wednesday, April 8, 2020 11:15 AM
> > To: Van Haaren, Harry <harry.van.haaren@intel.com>;
> thomas@monjalon.net;
> > Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > stephen@networkplumber.org; maxime.coquelin@redhat.com;
> dev@dpdk.org
> > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu
> > <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce
> Kong
> > <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; nd
> <nd@arm.com>
> > Subject: RE: [PATCH v3 07/12] service: remove rte prefix from static
> functions
> <snip>
> > > Is this really a "Fix"? The internal function names were not exported
> > > in the .map file, so are not part of public ABI. This is an internal
> > > naming improvement (thanks for doing cleanup), but I don't think the
> > > Fixes: tags make sense?
> > >
> > > Also I'm not sure if we want to port this patch back to stable? Changing
> > > (internal) function names seems like unnecessary churn, and hence risk
> to a
> > > stable release, without any benefit?
> > OK.
> > I will remove these tags in the next version and split the service core
> > patches from the original series into a series by itself.
> 
> Cool - good idea to split.
> 
> Perhaps we should focus on getting bugfixes in for the existing code, before
> doing cleanup? It would make backports easier if churn is minimal.
> 
> Suggesting patches order (first to last)
> 1. bugfixes/things to backport
> 2. cleanups
> 3. C11 atomic optimizations

That is a good idea. I will follow this order.

> 
> 
> > Thanks,
> > Phil
> 
> Thanks, and I'll get to reading/reviewing your and Honnappa's feedback later
> today.
> 
> -H

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-04 18:03         ` Honnappa Nagarahalli
@ 2020-04-08 18:05           ` Van Haaren, Harry
  2020-04-09  1:31             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-08 18:05 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Saturday, April 4, 2020 7:03 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> service
> 
> <snip>
> 
> > > Subject: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> > > service
> > >
> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >
> > > There has possible that a MT unsafe service might get configured to
> > > run on another core while the service is running currently. This might
> > > result in the MT unsafe service running on multiple cores
> > > simultaneously. Use 'execute_lock' always when the service is MT
> > > unsafe.
> > >
> > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> >
> > We should put "fix" in the title, once converged on an implementation.
> Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> 
> >
> > Regarding Fixes and stable backport, we should consider if fixing this in
> stable
> > with a performance degradation, fixing with more complex solution, or
> > documenting a known issue a better solution.
> >
> >
> > This fix (always taking the atomic lock) will have a negative performance
> > impact on existing code using services. We should investigate a way to fix
> it
> > without causing datapath performance degradation.
> Trying to gauge the impact on the existing applications...
> The documentation does not explicitly disallow run time mapping of cores to
> service.
> 1) If the applications are mapping the cores to services at run time, they are
> running with a bug. IMO, bug fix resulting in a performance drop should be
> acceptable.
> 2) If the service is configured to run on single core (num_mapped_cores == 1),
> but service is set to MT unsafe - this will have a (possible) performance
> impact.
> 	a) This can be solved by setting the service to MT safe and can be
> documented. This might be a reasonable solution for applications which are
> compiling with
>                    future DPDK releases.
> 	b) We can also solve this using symbol versioning - the old version of
> this function will use the old code, the new version of this function will use
> the code in
>                    this patch. So, if the application is run with future DPDK
> releases without recompiling, it will continue to use the old version. If the
> application is compiled
>                    with future releases, they can use solution in 2a. We also
> should think if this is an appropriate solution as this would force 1) to
> recompile to get the fix.
> 3) If the service is configured to run on multiple cores (num_mapped_cores >
> 1), then for those applications, the lock is being taken already. These
> applications might see some improvements as this patch removes few
> instructions.
>
> >
> > I think there is a way to achieve this by moving more checks/time to the
> > control path (lcore updating the map), and not forcing the datapath lcore to
> > always take an atomic.
> I think 2a above is the solution.

2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services. 
We should strive to not reduce datapath performance at all here.


> > In this particular case, we have a counter for number of iterations that a
> Which counter are you thinking about?
> All the counters I checked are not atomic operations currently. If we are
> going to use counters they have to be atomic, which means additional cycles in
> the data path.

I'll try to explain the concept better, take this example:
 - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
 - Application wants to map a 2nd lcore to the same service
 - You point out that today there is a race over the lock
    -- the lock is not taken if (num_mapped_lcores == 1)
    -- this avoids an atomic lock/unlock on the datapath

To achieve our desired goal;
 - control thread doing mapping performs the following operations
    -- write service->num_mapped_lcores++ (atomic not required, only single-writer allowed by APIs)
    -- MFENCE (full st-ld barrier) to flush write, and force later loads to issue after
    -- read the "calls_per_service" counter for each lcores, add them up.
    ---- Wait :)
    -- re-read the "calls_per_service", and ensure the count has changed.
    ---- The fact that the calls_per_service has changed ensures the service-lcore
         has seen the new "num_mapped_cores" value, and has now taken the lock!
    -- *now* it is safe to map the 2nd lcore to the service

There is a caveat here that the increments to the "calls_per_service" variable
must become globally-observable. To force this immediately would require a
write memory barrier, which could impact datapath performance. Given the service
is now taking a lock, the unlock() thereof would ensure the "calls_per_service"
is flushed to memory.

Note: we could use calls_per_service, or add a new variable to the service struct.
Writes to this do not need to be atomic, as it is either mapped to a single core,
or else there's a lock around it.


> > service has done. If this increments we know that the lcore running the
> > service has re-entered the critical section, so would see an updated "needs
> > atomic" flag.
> >
> > This approach may introduce a predictable branch on the datapath, however
> > the cost of a predictable branch vs always taking an atomic is order(s?) of
> > magnitude, so a branch is much preferred.
> >
> > It must be possible to avoid the datapath overhead using a scheme like this.
> It
> > will likely be more complex than your proposed change below, however if it
> > avoids datapath performance drops I feel that a more complex solution is
> > worth investigating at least.

> I do not completely understand the approach you are proposing, may be you can
> elaborate more.

Expanded above, showing a possible solution that does not require additional
atomics on the datapath.


> But, it seems to be based on a counter approach. Following is
> my assessment on what happens if we use a counter. Let us say we kept track of
> how many cores are running the service currently. We need an atomic counter
> other than 'num_mapped_cores'. Let us call that counter 'num_current_cores'.
> The code to call the service would look like below.
> 
> 1) rte_atomic32_inc(&num_current_cores); /* this results in a full memory
> barrier */
> 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) { /*
> rte_atomic_read is not enough here as it does not provide the required memory
> barrier for any architecture */
> 3) 	run_service(); /* Call the service */
> 4) }
> 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear is not
> enough as it is not an atomic operation and does not provide the required
> memory barrier */
> 
> But the above code has race conditions in lines 1 and 2. It is possible that
> none of the cores will ever get to run the service as they all could
> simultaneously increment the counter. Hence lines 1 and 2 together need to be
> atomic, which is nothing but 'compare-exchange' operation.
> 
> BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> >execute_lock)', it is missing memory barriers which results in clearing the
> execute_lock before the service has completed running. I suggest changing the
> 'execute_lock' to rte_spinlock_t and using rte_spinlock_try_lock and
> rte_spinlock_unlock APIs.

I don't think a spinlock is what we want here:

The idea is that a service-lcore can be mapped to multiple services.
If one service is already being run (by another thread), we do not want to
spin here waiting for it to become "free" to run by this thread, it should
continue to the next service that it is mapped to.


> >
> > A unit test is required to validate a fix like this - although perhaps found
> by
> > inspection/review, a real-world test to validate would give confidence.
> Agree, need to have a test case.
> 
> >
> >
> > Thoughts on such an approach?
> >
<snip patch contents>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations
  2020-04-06 17:06         ` Honnappa Nagarahalli
@ 2020-04-08 19:42           ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-08 19:42 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, April 6, 2020 6:06 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 12/12] service: relax barriers with C11 atomic
> operations
> 
> <snip>
> Just to get us on the same page on 'techniques to communicate data from writer
> to reader' (apologies if it is too trivial)
> 
> Let us say that the writer has 512B (key point is - cannot be written
> atomically) that needs to be communicated to the reader.
> 
> Since the data cannot be written atomically, we need a guard variable (which
> can be written atomically, can be a flag or pointer to data). So, the writer
> will store 512B in non-atomic way and write to guard variable with release
> memory order. So, if the guard variable is valid (set in the case of flag or
> not null in the case of pointer), it guarantees that 512B is written.
> 
> The reader will read the guard variable with acquire memory order and read the
> 512B data only if the guard variable is valid. So, the acquire memory order on
> the guard variable guarantees that the load of 512B does not happen before the
> guard variable is read. The validity check on the guard variable guarantees
> that 512B was written before it was read.
> 
> The store(guard_variable, RELEASE) on the writer and the load(guard_variable,
> ACQUIRE) can be said as synchronizing with each other.
> 
> (the guard variable technique applies even if we are not using C11 atomics)

Yep agreed on the above.


> Let us say that the writer has 4B (key point is - can be written atomically)
> that needs to be communicated to the reader. The writer is free to write this
> atomically with no constraints on memory ordering as long as this data is not
> acting as a guard variable for any other data.
> 
> In my understanding, the sequence of APIs to call to start a service (writer)
> are as follows:
> 1) rte_service_init
> 2) rte_service_component_register
> 3) <possible configuration of the service>
> 4) rte_service_component_runstate_set (the reader is allowed at this point to
> read the information about the service - written by
> rte_service_component_register API. This API should not be called before
> rte_service_component_register)
> 5) <possible configuration of the service>
> 6) rte_service_runstate_set (the reader is allowed at this point to read the
> information about the service - written by rte_service_component_register API
> and run the service. This API can be called anytime. But, the reader should
> not attempt to run the service before this API is called)
> 7) rte_lcore_service_add (multiple of these probably, can be called before
> this, can't be called later)
> 8) rte_service_map_lcore_set (this can be called anytime. Can be called even
> if the service is not registered)
> 9) rte_service_lcore_start (again, this can be called anytime, even before the
> service is registered)

I think this can be simplified, if we look at calling threads:
 - one thread is the writer/config thread, and is allowed to call anything
 --- Any updates/changes must be atomically correct in how the other threads can read state.

 - service lcores, which fundamentally spin in run(), and call services mapped to it
 --- here we need to ensure any service mapped to it is atomic, and the service is valid to run.

 - other application threads using "run on app lcore" function
 --- similar to service lcore, check for service in valid state, and allow to run.


Services are not allowed to be unregistered e.g. while running.
I'd like to avoid the "explosion of combinations" by enforcing
simple limitations (and documenting them more clearly if/where required).


> So, there are 2 guard variables - 'comp_runstate' and 'app_runstate'. Only
> these 2 need to have RELEASE ordering in writer and ACQUIRE ordering in
> reader.
> 
> We can write test cases with different orders of these API calls to prove that
> the memory orders we use are sufficient.
> 
> Few comments are inline based on this assessment.

Sure. As per other email thread, splitting changes into bugfix/cleanup/C11
would likely help to try keep track of changes etc required per patch, its
getting hard to follow the various topics being discussed in parallel.


> > Subject: RE: [PATCH v3 12/12] service: relax barriers with C11 atomic
> > operations
> >
> > > From: Phil Yang <phil.yang@arm.com>
> > > Sent: Tuesday, March 17, 2020 1:18 AM
> > > To: thomas@monjalon.net; Van Haaren, Harry
> > > <harry.van.haaren@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> > > maxime.coquelin@redhat.com; dev@dpdk.org
> > > Cc: david.marchand@redhat.com; jerinj@marvell.com;
> > > hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> > > gavin.hu@arm.com; ruifeng.wang@arm.com; joyce.kong@arm.com;
> > nd@arm.com
> > > Subject: [PATCH v3 12/12] service: relax barriers with C11 atomic
> > > operations
> > >
> > > To guarantee the inter-threads visibility of the shareable domain, it
> > > uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
> > > these barriers for service by using c11 atomic one-way barrier operations.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > ---
> > >  lib/librte_eal/common/rte_service.c | 45

<snip patch review comments>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-08 18:05           ` Van Haaren, Harry
@ 2020-04-09  1:31             ` Honnappa Nagarahalli
  2020-04-09 16:46               ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-09  1:31 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, Honnappa Nagarahalli, nd

<snip>

> >
> > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > unsafe service
> > > >
> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > >
> > > > There has possible that a MT unsafe service might get configured
> > > > to run on another core while the service is running currently.
> > > > This might result in the MT unsafe service running on multiple
> > > > cores simultaneously. Use 'execute_lock' always when the service
> > > > is MT unsafe.
> > > >
> > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > >
> > > We should put "fix" in the title, once converged on an implementation.
> > Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> >
> > >
> > > Regarding Fixes and stable backport, we should consider if fixing
> > > this in
> > stable
> > > with a performance degradation, fixing with more complex solution,
> > > or documenting a known issue a better solution.
> > >
> > >
> > > This fix (always taking the atomic lock) will have a negative
> > > performance impact on existing code using services. We should
> > > investigate a way to fix
> > it
> > > without causing datapath performance degradation.
> > Trying to gauge the impact on the existing applications...
> > The documentation does not explicitly disallow run time mapping of
> > cores to service.
> > 1) If the applications are mapping the cores to services at run time,
> > they are running with a bug. IMO, bug fix resulting in a performance
> > drop should be acceptable.
> > 2) If the service is configured to run on single core
> > (num_mapped_cores == 1), but service is set to MT unsafe - this will
> > have a (possible) performance impact.
> > 	a) This can be solved by setting the service to MT safe and can be
> > documented. This might be a reasonable solution for applications which
> > are compiling with
> >                    future DPDK releases.
> > 	b) We can also solve this using symbol versioning - the old version
> > of this function will use the old code, the new version of this
> > function will use the code in
> >                    this patch. So, if the application is run with
> > future DPDK releases without recompiling, it will continue to use the
> > old version. If the application is compiled
> >                    with future releases, they can use solution in 2a.
> > We also should think if this is an appropriate solution as this would
> > force 1) to recompile to get the fix.
> > 3) If the service is configured to run on multiple cores
> > (num_mapped_cores > 1), then for those applications, the lock is being
> > taken already. These applications might see some improvements as this
> > patch removes few instructions.
> >
> > >
> > > I think there is a way to achieve this by moving more checks/time to
> > > the control path (lcore updating the map), and not forcing the
> > > datapath lcore to always take an atomic.
> > I think 2a above is the solution.
> 
> 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
I scanned through the code briefly
I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE capabilities, can be ignored.
Timer adaptor and some others do not set MT_SAFE. Seems like the cores to run on are mapped during run time. But it is not clear to me if it can get mapped to run on multiple cores. If they are, they are running with the bug.
But, these are all internal to DPDK and can be fixed.
Are there no performance tests in these components that we can run?

> We should strive to not reduce datapath performance at all here.
> 
> 
> > > In this particular case, we have a counter for number of iterations
> > > that a
> > Which counter are you thinking about?
> > All the counters I checked are not atomic operations currently. If we
> > are going to use counters they have to be atomic, which means
> > additional cycles in the data path.
> 
> I'll try to explain the concept better, take this example:
>  - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
>  - Application wants to map a 2nd lcore to the same service
>  - You point out that today there is a race over the lock
>     -- the lock is not taken if (num_mapped_lcores == 1)
>     -- this avoids an atomic lock/unlock on the datapath
> 
> To achieve our desired goal;
>  - control thread doing mapping performs the following operations
>     -- write service->num_mapped_lcores++ (atomic not required, only single-
> writer allowed by APIs)
This has to be atomic because of rte_service_run_iter_on_app_lcore API. Performance should be fine as this API is not called frequently. But need to consider the implications of more than one thread updating num_mapped_cores.

>     -- MFENCE (full st-ld barrier) to flush write, and force later loads to issue
> after
I am not exactly sure what MFENCE on x86 does. On Arm platforms, the full barrier (DMB ISH) just makes sure that memory operations are not re-ordered around it. It does not say anything about when that store is visible to other cores. It will be visible at some point in time to cores.
But, I do not think we need to be worried about flushing to memory.

>     -- read the "calls_per_service" counter for each lcores, add them up.
This can be trimmed down to the single core the service is mapped to currently, no need to add all the counters.

>     ---- Wait :)
>     -- re-read the "calls_per_service", and ensure the count has changed.
Basically, polling. This causes more traffic on the interconnect between the cores. But might be ok since this API might not be called frequently.

>     ---- The fact that the calls_per_service has changed ensures the service-
> lcore
>          has seen the new "num_mapped_cores" value, and has now taken the
> lock!
>     -- *now* it is safe to map the 2nd lcore to the service
> 
> There is a caveat here that the increments to the "calls_per_service" variable
> must become globally-observable. To force this immediately would require a
> write memory barrier, which could impact datapath performance. Given the
> service is now taking a lock, the unlock() thereof would ensure the
> "calls_per_service"
> is flushed to memory.
If we increment this variable only when the lock is held, we should be fine. We could have a separate variable.

> 
> Note: we could use calls_per_service, or add a new variable to the service
> struct.
> Writes to this do not need to be atomic, as it is either mapped to a single core,
> or else there's a lock around it.
I think it is better to have a separate variable that is updated only when the lock is held.
I do not see any change in API sequence. We do this hand-shake only if the service is running (which is all controlled in the writer thread), correct?

This does not solve the problem with rte_service_run_iter_on_app_lcore getting called on multiple cores concurrently for the same service. 

> 
> 
> > > service has done. If this increments we know that the lcore running
> > > the service has re-entered the critical section, so would see an
> > > updated "needs atomic" flag.
> > >
> > > This approach may introduce a predictable branch on the datapath,
> > > however the cost of a predictable branch vs always taking an atomic
> > > is order(s?) of magnitude, so a branch is much preferred.
> > >
> > > It must be possible to avoid the datapath overhead using a scheme like
> this.
> > It
> > > will likely be more complex than your proposed change below, however
> > > if it avoids datapath performance drops I feel that a more complex
> > > solution is worth investigating at least.
> 
> > I do not completely understand the approach you are proposing, may be
> > you can elaborate more.
> 
> Expanded above, showing a possible solution that does not require additional
> atomics on the datapath.
> 
> 
> > But, it seems to be based on a counter approach. Following is my
> > assessment on what happens if we use a counter. Let us say we kept
> > track of how many cores are running the service currently. We need an
> > atomic counter other than 'num_mapped_cores'. Let us call that counter
> 'num_current_cores'.
> > The code to call the service would look like below.
> >
> > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > memory barrier */
> > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) {
> > /* rte_atomic_read is not enough here as it does not provide the
> > required memory barrier for any architecture */
> > 3) 	run_service(); /* Call the service */
> > 4) }
> > 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear
> > is not enough as it is not an atomic operation and does not provide
> > the required memory barrier */
> >
> > But the above code has race conditions in lines 1 and 2. It is
> > possible that none of the cores will ever get to run the service as
> > they all could simultaneously increment the counter. Hence lines 1 and
> > 2 together need to be atomic, which is nothing but 'compare-exchange'
> operation.
> >
> > BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> > >execute_lock)', it is missing memory barriers which results in
> > >clearing the
> > execute_lock before the service has completed running. I suggest
> > changing the 'execute_lock' to rte_spinlock_t and using
> > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> 
> I don't think a spinlock is what we want here:
> 
> The idea is that a service-lcore can be mapped to multiple services.
> If one service is already being run (by another thread), we do not want to spin
> here waiting for it to become "free" to run by this thread, it should continue
> to the next service that it is mapped to.
Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin) which is nothing but 'compare-exchange'. Since the API is available, we should make use of it instead of repeating the code.

> 
> 
> > >
> > > A unit test is required to validate a fix like this - although
> > > perhaps found
> > by
> > > inspection/review, a real-world test to validate would give confidence.
> > Agree, need to have a test case.
> >
> > >
> > >
> > > Thoughts on such an approach?
> > >
> <snip patch contents>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-09  1:31             ` Honnappa Nagarahalli
@ 2020-04-09 16:46               ` Van Haaren, Harry
  2020-04-18  6:21                 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-09 16:46 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, April 9, 2020 2:32 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>; stable@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe
> service
> 
> <snip>
> 
> > >
> > > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > > unsafe service
> > > > >
> > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > >
> > > > > There has possible that a MT unsafe service might get configured
> > > > > to run on another core while the service is running currently.
> > > > > This might result in the MT unsafe service running on multiple
> > > > > cores simultaneously. Use 'execute_lock' always when the service
> > > > > is MT unsafe.
> > > > >
> > > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > >
> > > > We should put "fix" in the title, once converged on an implementation.
> > > Ok, will replace 'avoid' with 'fix' (once we agree on the solution)
> > >
> > > >
> > > > Regarding Fixes and stable backport, we should consider if fixing
> > > > this in
> > > stable
> > > > with a performance degradation, fixing with more complex solution,
> > > > or documenting a known issue a better solution.
> > > >
> > > >
> > > > This fix (always taking the atomic lock) will have a negative
> > > > performance impact on existing code using services. We should
> > > > investigate a way to fix
> > > it
> > > > without causing datapath performance degradation.
> > > Trying to gauge the impact on the existing applications...
> > > The documentation does not explicitly disallow run time mapping of
> > > cores to service.
> > > 1) If the applications are mapping the cores to services at run time,
> > > they are running with a bug. IMO, bug fix resulting in a performance
> > > drop should be acceptable.
> > > 2) If the service is configured to run on single core
> > > (num_mapped_cores == 1), but service is set to MT unsafe - this will
> > > have a (possible) performance impact.
> > > 	a) This can be solved by setting the service to MT safe and can be
> > > documented. This might be a reasonable solution for applications which
> > > are compiling with
> > >                    future DPDK releases.
> > > 	b) We can also solve this using symbol versioning - the old version
> > > of this function will use the old code, the new version of this
> > > function will use the code in
> > >                    this patch. So, if the application is run with
> > > future DPDK releases without recompiling, it will continue to use the
> > > old version. If the application is compiled
> > >                    with future releases, they can use solution in 2a.
> > > We also should think if this is an appropriate solution as this would
> > > force 1) to recompile to get the fix.
> > > 3) If the service is configured to run on multiple cores
> > > (num_mapped_cores > 1), then for those applications, the lock is being
> > > taken already. These applications might see some improvements as this
> > > patch removes few instructions.
> > >
> > > >
> > > > I think there is a way to achieve this by moving more checks/time to
> > > > the control path (lcore updating the map), and not forcing the
> > > > datapath lcore to always take an atomic.
> > > I think 2a above is the solution.
> >
> > 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
> I scanned through the code briefly
> I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE capabilities,
> can be ignored.
> Timer adaptor and some others do not set MT_SAFE. Seems like the cores to run
> on are mapped during run time. But it is not clear to me if it can get mapped
> to run on multiple cores. If they are, they are running with the bug.

EAL will map each service to a single lcore. It will "round-robin" if
there are more services than service-lcores to run them on. So agree
that DPDK's default mappings will not suffer this issue.


> But, these are all internal to DPDK and can be fixed.
> Are there no performance tests in these components that we can run?
>
> > We should strive to not reduce datapath performance at all here.
> >
> >
> > > > In this particular case, we have a counter for number of iterations
> > > > that a
> > > Which counter are you thinking about?
> > > All the counters I checked are not atomic operations currently. If we
> > > are going to use counters they have to be atomic, which means
> > > additional cycles in the data path.
> >
> > I'll try to explain the concept better, take this example:
> >  - One service core is mapped to a MT_UNSAFE service, like event/sw pmd
> >  - Application wants to map a 2nd lcore to the same service
> >  - You point out that today there is a race over the lock
> >     -- the lock is not taken if (num_mapped_lcores == 1)
> >     -- this avoids an atomic lock/unlock on the datapath
> >
> > To achieve our desired goal;
> >  - control thread doing mapping performs the following operations
> >     -- write service->num_mapped_lcores++ (atomic not required, only single-
> > writer allowed by APIs)
> This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> Performance should be fine as this API is not called frequently. But need to
> consider the implications of more than one thread updating num_mapped_cores.
> 
> >     -- MFENCE (full st-ld barrier) to flush write, and force later loads to
> issue
> > after
> I am not exactly sure what MFENCE on x86 does. On Arm platforms, the full
> barrier (DMB ISH) just makes sure that memory operations are not re-ordered
> around it. It does not say anything about when that store is visible to other
> cores. It will be visible at some point in time to cores.
> But, I do not think we need to be worried about flushing to memory.
> 
> >     -- read the "calls_per_service" counter for each lcores, add them up.
> This can be trimmed down to the single core the service is mapped to
> currently, no need to add all the counters.

Correct - however that requires figuring out which lcore is running the
service. Anyway, agree - it's an implementation detail as to exactly how
we detect it.

> 
> >     ---- Wait :)
> >     -- re-read the "calls_per_service", and ensure the count has changed.
> Basically, polling. This causes more traffic on the interconnect between the
> cores. But might be ok since this API might not be called frequently.

Agree this will not be called frequently, and that some polling here will
not be a problem.


> >     ---- The fact that the calls_per_service has changed ensures the
> service-
> > lcore
> >          has seen the new "num_mapped_cores" value, and has now taken the
> > lock!
> >     -- *now* it is safe to map the 2nd lcore to the service
> >
> > There is a caveat here that the increments to the "calls_per_service"
> variable
> > must become globally-observable. To force this immediately would require a
> > write memory barrier, which could impact datapath performance. Given the
> > service is now taking a lock, the unlock() thereof would ensure the
> > "calls_per_service"
> > is flushed to memory.
> If we increment this variable only when the lock is held, we should be fine.
> We could have a separate variable.

Sure, if a separate variable is preferred that's fine with me.


> > Note: we could use calls_per_service, or add a new variable to the service
> > struct.
> > Writes to this do not need to be atomic, as it is either mapped to a single
> core,
> > or else there's a lock around it.
> I think it is better to have a separate variable that is updated only when the
> lock is held.
> I do not see any change in API sequence. We do this hand-shake only if the
> service is running (which is all controlled in the writer thread), correct?

Yes this increment can be localized to just the branch when the unlock() occurs,
as that is the only time it could make a difference.

> This does not solve the problem with rte_service_run_iter_on_app_lcore getting
> called on multiple cores concurrently for the same service.

Agreed. This "on_app_lcore" API was an addition required to enable unit-testing
in a sane way, to run iterations of eg Eventdev PMD.

I am in favor of documenting that the application is responsible to ensure
the service being run on a specific application lcore is not concurrently
running on another application lcore.


> > > > service has done. If this increments we know that the lcore running
> > > > the service has re-entered the critical section, so would see an
> > > > updated "needs atomic" flag.
> > > >
> > > > This approach may introduce a predictable branch on the datapath,
> > > > however the cost of a predictable branch vs always taking an atomic
> > > > is order(s?) of magnitude, so a branch is much preferred.
> > > >
> > > > It must be possible to avoid the datapath overhead using a scheme like
> > this.
> > > It
> > > > will likely be more complex than your proposed change below, however
> > > > if it avoids datapath performance drops I feel that a more complex
> > > > solution is worth investigating at least.
> >
> > > I do not completely understand the approach you are proposing, may be
> > > you can elaborate more.
> >
> > Expanded above, showing a possible solution that does not require additional
> > atomics on the datapath.
> >
> >
> > > But, it seems to be based on a counter approach. Following is my
> > > assessment on what happens if we use a counter. Let us say we kept
> > > track of how many cores are running the service currently. We need an
> > > atomic counter other than 'num_mapped_cores'. Let us call that counter
> > 'num_current_cores'.
> > > The code to call the service would look like below.
> > >
> > > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > > memory barrier */
> > > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1) {
> > > /* rte_atomic_read is not enough here as it does not provide the
> > > required memory barrier for any architecture */
> > > 3) 	run_service(); /* Call the service */
> > > 4) }
> > > 5) rte_atomic32_sub(&num_current_cores); /* Calling rte_atomic32_clear
> > > is not enough as it is not an atomic operation and does not provide
> > > the required memory barrier */
> > >
> > > But the above code has race conditions in lines 1 and 2. It is
> > > possible that none of the cores will ever get to run the service as
> > > they all could simultaneously increment the counter. Hence lines 1 and
> > > 2 together need to be atomic, which is nothing but 'compare-exchange'
> > operation.
> > >
> > > BTW, the current code has a bug where it calls 'rte_atomic_clear(&s-
> > > >execute_lock)', it is missing memory barriers which results in
> > > >clearing the
> > > execute_lock before the service has completed running. I suggest
> > > changing the 'execute_lock' to rte_spinlock_t and using
> > > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> >
> > I don't think a spinlock is what we want here:
> >
> > The idea is that a service-lcore can be mapped to multiple services.
> > If one service is already being run (by another thread), we do not want to
> spin
> > here waiting for it to become "free" to run by this thread, it should
> continue
> > to the next service that it is mapped to.
> Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin) which is
> nothing but 'compare-exchange'. Since the API is available, we should make use
> of it instead of repeating the code.

Ah apologies, I misread the spinlock usage. Sure if the spinlock_t code
is preferred I'm ok with a change. It would be clean to have a separate
patch in the patchset to make this change, and have it later in the set
than the changes for backporting to ease integration with stable branch.


> > > > A unit test is required to validate a fix like this - although
> > > > perhaps found
> > > by
> > > > inspection/review, a real-world test to validate would give confidence.
> > > Agree, need to have a test case.
> > >
> > > >
> > > >
> > > > Thoughts on such an approach?
> > > >
> > <snip patch contents>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-09 16:46               ` Van Haaren, Harry
@ 2020-04-18  6:21                 ` Honnappa Nagarahalli
  2020-04-21 17:43                   ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-18  6:21 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, nd, Honnappa Nagarahalli, nd

<snip>

> > > >
> > > > > > Subject: [PATCH v3 09/12] service: avoid race condition for MT
> > > > > > unsafe service
> > > > > >
> > > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > >
> > > > > > There has possible that a MT unsafe service might get
> > > > > > configured to run on another core while the service is running
> currently.
> > > > > > This might result in the MT unsafe service running on multiple
> > > > > > cores simultaneously. Use 'execute_lock' always when the
> > > > > > service is MT unsafe.
> > > > > >
> > > > > > Fixes: e9139a32f6e8 ("service: add function to run on app
> > > > > > lcore")
> > > > > > Cc: stable@dpdk.org
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > >
> > > > > We should put "fix" in the title, once converged on an implementation.
> > > > Ok, will replace 'avoid' with 'fix' (once we agree on the
> > > > solution)
> > > >
> > > > >
> > > > > Regarding Fixes and stable backport, we should consider if
> > > > > fixing this in
> > > > stable
> > > > > with a performance degradation, fixing with more complex
> > > > > solution, or documenting a known issue a better solution.
> > > > >
> > > > >
> > > > > This fix (always taking the atomic lock) will have a negative
> > > > > performance impact on existing code using services. We should
> > > > > investigate a way to fix
> > > > it
> > > > > without causing datapath performance degradation.
> > > > Trying to gauge the impact on the existing applications...
> > > > The documentation does not explicitly disallow run time mapping of
> > > > cores to service.
> > > > 1) If the applications are mapping the cores to services at run
> > > > time, they are running with a bug. IMO, bug fix resulting in a
> > > > performance drop should be acceptable.
> > > > 2) If the service is configured to run on single core
> > > > (num_mapped_cores == 1), but service is set to MT unsafe - this
> > > > will have a (possible) performance impact.
> > > > 	a) This can be solved by setting the service to MT safe and can
> > > > be documented. This might be a reasonable solution for
> > > > applications which are compiling with
> > > >                    future DPDK releases.
> > > > 	b) We can also solve this using symbol versioning - the old
> > > > version of this function will use the old code, the new version of
> > > > this function will use the code in
> > > >                    this patch. So, if the application is run with
> > > > future DPDK releases without recompiling, it will continue to use
> > > > the old version. If the application is compiled
> > > >                    with future releases, they can use solution in 2a.
> > > > We also should think if this is an appropriate solution as this
> > > > would force 1) to recompile to get the fix.
> > > > 3) If the service is configured to run on multiple cores
> > > > (num_mapped_cores > 1), then for those applications, the lock is
> > > > being taken already. These applications might see some
> > > > improvements as this patch removes few instructions.
> > > >
> > > > >
> > > > > I think there is a way to achieve this by moving more
> > > > > checks/time to the control path (lcore updating the map), and
> > > > > not forcing the datapath lcore to always take an atomic.
> > > > I think 2a above is the solution.
> > >
> > > 2a above is e.g. the Eventdev SW routines like Rx/Tx scheduler services.
> > I scanned through the code briefly
> > I see that Eth RX/TX, Crypto adapters are setting the MT_SAFE
> > capabilities, can be ignored.
> > Timer adaptor and some others do not set MT_SAFE. Seems like the cores
> > to run on are mapped during run time. But it is not clear to me if it
> > can get mapped to run on multiple cores. If they are, they are running with
> the bug.
> 
> EAL will map each service to a single lcore. It will "round-robin" if there are
> more services than service-lcores to run them on. So agree that DPDK's
> default mappings will not suffer this issue.
> 
> 
> > But, these are all internal to DPDK and can be fixed.
> > Are there no performance tests in these components that we can run?
> >
> > > We should strive to not reduce datapath performance at all here.
> > >
> > >
> > > > > In this particular case, we have a counter for number of
> > > > > iterations that a
> > > > Which counter are you thinking about?
> > > > All the counters I checked are not atomic operations currently. If
> > > > we are going to use counters they have to be atomic, which means
> > > > additional cycles in the data path.
> > >
> > > I'll try to explain the concept better, take this example:
I tried to implement this algorithm, but there are few issues, please see below.

> > >  - One service core is mapped to a MT_UNSAFE service, like event/sw
> > > pmd
> > >  - Application wants to map a 2nd lcore to the same service
> > >  - You point out that today there is a race over the lock
> > >     -- the lock is not taken if (num_mapped_lcores == 1)
> > >     -- this avoids an atomic lock/unlock on the datapath
> > >
> > > To achieve our desired goal;
> > >  - control thread doing mapping performs the following operations
> > >     -- write service->num_mapped_lcores++ (atomic not required, only
> > > single- writer allowed by APIs)
> > This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> > Performance should be fine as this API is not called frequently. But
> > need to consider the implications of more than one thread updating
> num_mapped_cores.
> >
> > >     -- MFENCE (full st-ld barrier) to flush write, and force later
> > > loads to
> > issue
> > > after
> > I am not exactly sure what MFENCE on x86 does. On Arm platforms, the
> > full barrier (DMB ISH) just makes sure that memory operations are not
> > re-ordered around it. It does not say anything about when that store
> > is visible to other cores. It will be visible at some point in time to cores.
> > But, I do not think we need to be worried about flushing to memory.
> >
> > >     -- read the "calls_per_service" counter for each lcores, add them up.
> > This can be trimmed down to the single core the service is mapped to
> > currently, no need to add all the counters.
> 
> Correct - however that requires figuring out which lcore is running the service.
> Anyway, agree - it's an implementation detail as to exactly how we detect it.
> 
> >
> > >     ---- Wait :)
> > >     -- re-read the "calls_per_service", and ensure the count has changed.
Here, there is an assumption that the service core function is running on the service core. If the service core is not running, the code will be stuck in this polling loop.

I could not come up with a good way to check if the service core is running. Checking the app_runstate and comp_runstate is not enough as they just indicate that the service is ready to run. Using the counter 'calls_per_service' introduces race conditions.

Only way I can think of is asking the user to follow a specific sequence of APIs to ensure the service core is running before calling rte_service_map_lcore_set.


> > Basically, polling. This causes more traffic on the interconnect
> > between the cores. But might be ok since this API might not be called
> frequently.
> 
> Agree this will not be called frequently, and that some polling here will not be
> a problem.
> 
> 
> > >     ---- The fact that the calls_per_service has changed ensures the
> > service-
> > > lcore
> > >          has seen the new "num_mapped_cores" value, and has now
> > > taken the lock!
> > >     -- *now* it is safe to map the 2nd lcore to the service
> > >
> > > There is a caveat here that the increments to the "calls_per_service"
> > variable
> > > must become globally-observable. To force this immediately would
> > > require a write memory barrier, which could impact datapath
> > > performance. Given the service is now taking a lock, the unlock()
> > > thereof would ensure the "calls_per_service"
> > > is flushed to memory.
> > If we increment this variable only when the lock is held, we should be fine.
> > We could have a separate variable.
> 
> Sure, if a separate variable is preferred that's fine with me.
> 
> 
> > > Note: we could use calls_per_service, or add a new variable to the
> > > service struct.
> > > Writes to this do not need to be atomic, as it is either mapped to a
> > > single
> > core,
> > > or else there's a lock around it.
> > I think it is better to have a separate variable that is updated only
> > when the lock is held.
> > I do not see any change in API sequence. We do this hand-shake only if
> > the service is running (which is all controlled in the writer thread), correct?
> 
> Yes this increment can be localized to just the branch when the unlock()
> occurs, as that is the only time it could make a difference.
> 
> > This does not solve the problem with rte_service_run_iter_on_app_lcore
> > getting called on multiple cores concurrently for the same service.
> 
> Agreed. This "on_app_lcore" API was an addition required to enable unit-
> testing in a sane way, to run iterations of eg Eventdev PMD.
> 
> I am in favor of documenting that the application is responsible to ensure the
> service being run on a specific application lcore is not concurrently running on
> another application lcore.
> 
> 
> > > > > service has done. If this increments we know that the lcore
> > > > > running the service has re-entered the critical section, so
> > > > > would see an updated "needs atomic" flag.
> > > > >
> > > > > This approach may introduce a predictable branch on the
> > > > > datapath, however the cost of a predictable branch vs always
> > > > > taking an atomic is order(s?) of magnitude, so a branch is much
> preferred.
> > > > >
> > > > > It must be possible to avoid the datapath overhead using a
> > > > > scheme like
> > > this.
> > > > It
> > > > > will likely be more complex than your proposed change below,
> > > > > however if it avoids datapath performance drops I feel that a
> > > > > more complex solution is worth investigating at least.
> > >
> > > > I do not completely understand the approach you are proposing, may
> > > > be you can elaborate more.
> > >
> > > Expanded above, showing a possible solution that does not require
> > > additional atomics on the datapath.
> > >
> > >
> > > > But, it seems to be based on a counter approach. Following is my
> > > > assessment on what happens if we use a counter. Let us say we kept
> > > > track of how many cores are running the service currently. We need
> > > > an atomic counter other than 'num_mapped_cores'. Let us call that
> > > > counter
> > > 'num_current_cores'.
> > > > The code to call the service would look like below.
> > > >
> > > > 1) rte_atomic32_inc(&num_current_cores); /* this results in a full
> > > > memory barrier */
> > > > 2) if (__atomic_load_n(&num_current_cores, __ATOMIC_ACQUIRE) == 1)
> > > > {
> > > > /* rte_atomic_read is not enough here as it does not provide the
> > > > required memory barrier for any architecture */
> > > > 3) 	run_service(); /* Call the service */
> > > > 4) }
> > > > 5) rte_atomic32_sub(&num_current_cores); /* Calling
> > > > rte_atomic32_clear is not enough as it is not an atomic operation
> > > > and does not provide the required memory barrier */
> > > >
> > > > But the above code has race conditions in lines 1 and 2. It is
> > > > possible that none of the cores will ever get to run the service
> > > > as they all could simultaneously increment the counter. Hence
> > > > lines 1 and
> > > > 2 together need to be atomic, which is nothing but 'compare-exchange'
> > > operation.
> > > >
> > > > BTW, the current code has a bug where it calls
> > > > 'rte_atomic_clear(&s-
> > > > >execute_lock)', it is missing memory barriers which results in
> > > > >clearing the
> > > > execute_lock before the service has completed running. I suggest
> > > > changing the 'execute_lock' to rte_spinlock_t and using
> > > > rte_spinlock_try_lock and rte_spinlock_unlock APIs.
> > >
> > > I don't think a spinlock is what we want here:
> > >
> > > The idea is that a service-lcore can be mapped to multiple services.
> > > If one service is already being run (by another thread), we do not
> > > want to
> > spin
> > > here waiting for it to become "free" to run by this thread, it
> > > should
> > continue
> > > to the next service that it is mapped to.
> > Agree. I am suggesting to use 'rte_spinlock_try_lock' (does not spin)
> > which is nothing but 'compare-exchange'. Since the API is available,
> > we should make use of it instead of repeating the code.
> 
> Ah apologies, I misread the spinlock usage. Sure if the spinlock_t code is
> preferred I'm ok with a change. It would be clean to have a separate patch in
> the patchset to make this change, and have it later in the set than the changes
> for backporting to ease integration with stable branch.
> 
> 
> > > > > A unit test is required to validate a fix like this - although
> > > > > perhaps found
> > > > by
> > > > > inspection/review, a real-world test to validate would give confidence.
> > > > Agree, need to have a test case.
> > > >
> > > > >
> > > > >
> > > > > Thoughts on such an approach?
> > > > >
> > > <snip patch contents>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-04-18  6:21                 ` Honnappa Nagarahalli
@ 2020-04-21 17:43                   ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-21 17:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, stable, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Saturday, April 18, 2020 7:22 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 09/12] service: avoid race condition for MT unsafe service
> 
> <snip>
<snip snip>
> > > > To achieve our desired goal;
> > > >  - control thread doing mapping performs the following operations
> > > >     -- write service->num_mapped_lcores++ (atomic not required, only
> > > > single- writer allowed by APIs)
> > > This has to be atomic because of rte_service_run_iter_on_app_lcore API.
> > > Performance should be fine as this API is not called frequently. But
> > > need to consider the implications of more than one thread updating
> > num_mapped_cores.
> > >
> > > >     -- MFENCE (full st-ld barrier) to flush write, and force later
> > > > loads to
> > > issue
> > > > after
> > > I am not exactly sure what MFENCE on x86 does. On Arm platforms, the
> > > full barrier (DMB ISH) just makes sure that memory operations are not
> > > re-ordered around it. It does not say anything about when that store
> > > is visible to other cores. It will be visible at some point in time to cores.
> > > But, I do not think we need to be worried about flushing to memory.
> > >
> > > >     -- read the "calls_per_service" counter for each lcores, add them up.
> > > This can be trimmed down to the single core the service is mapped to
> > > currently, no need to add all the counters.
> >
> > Correct - however that requires figuring out which lcore is running the service.
> > Anyway, agree - it's an implementation detail as to exactly how we detect it.
> >
> > >
> > > >     ---- Wait :)
> > > >     -- re-read the "calls_per_service", and ensure the count has changed.
> Here, there is an assumption that the service core function is running on the
> service core. If the service core is not running, the code will be stuck in this
> polling loop.

Right - we could add a timeout ehre, but that just moves the problem somewhere
else (the application) which now needs to handle error rets, and possibly retries.
It could be a possible solution.. I'm not in favour of it at the moment, but it
needs some more time to think.

> I could not come up with a good way to check if the service core is running.
> Checking the app_runstate and comp_runstate is not enough as they just
> indicate that the service is ready to run. Using the counter 'calls_per_service'
> introduces race conditions.
> 
> Only way I can think of is asking the user to follow a specific sequence of APIs to
> ensure the service core is running before calling rte_service_map_lcore_set.

Good point - I'm thinking about this - but haven't come to an obvious conclusion yet.
I'm considering other ways to detect the core is/isn't running, and also considering
just "high-jacking" the service function pointer temporarily with a CAS, which gives
some new options on avoiding threads entering the critical section.

As above, I don't have a good solution yet.

<snip irrelevant to above discussion stuff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
  2020-04-03 11:57       ` Van Haaren, Harry
  2020-04-05 21:35       ` Honnappa Nagarahalli
@ 2020-04-23 16:31       ` Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service Phil Yang
                           ` (7 more replies)
  2 siblings, 8 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd

The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
Using c11 atomics with explicit memory ordering instead of the rte_atomic
ops and rte_smp barriers for inter-threads synchronization can uplift the
performance on aarch64 and no performance loss on x86.

This patchset contains:
1) fix race condition for MT unsafe service.
2) clean up redundant code.
3) use c11 atomics for service core lib to avoid unnecessary barriers.

v2:
Still waiting on Harry for the final solution on the MT unsafe race
condition issue. But I have incorporated the comments so far.
1. add 'Fixes' tag for bug-fix patches.
2. remove 'Fixes' tag for code cleanup patches.
3. remove unused parameter for service_dump_one function.
4. replace the execute_lock atomic CAS operation to spinlock_try_lock.
5. use c11 atomics with RELAXED memory ordering for num_mapped_cores.
6. relax barriers for guard variables runstate, comp_runstate and
app_runstate with c11 one-way barriers.

Honnappa Nagarahalli (2):
  service: fix race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (4):
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 atomics
  service: relax barriers with C11 atomics

 lib/librte_eal/common/rte_service.c | 234 +++++++++++++++++++-----------------
 lib/librte_eal/meson.build          |   4 +
 2 files changed, 130 insertions(+), 108 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-04-29 16:51           ` Van Haaren, Harry
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 2/6] service: identify service running on another core correctly Phil Yang
                           ` (6 subsequent siblings)
  7 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd,
	Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5..b8c465e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 2/6] service: identify service running on another core correctly
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 3/6] service: remove rte prefix from static functions Phil Yang
                           ` (5 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd,
	Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org
Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b8c465e..c89472b 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ rte_service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ rte_service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 3/6] service: remove rte prefix from static functions
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 2/6] service: identify service running on another core correctly Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 4/6] service: remove redundant code Phil Yang
                           ` (4 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd

clean up rte prefix from static functions.
remove unused parameter for service_dump_one function.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 34 +++++++++++-----------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c89472b..ed20702 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -340,7 +340,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -378,10 +378,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -425,14 +425,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (cs->runstate == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -693,9 +693,9 @@ rte_service_lcore_start(uint32_t lcore)
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
+	cs->runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -774,13 +774,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
-		     uint64_t all_cycles, uint32_t reset)
+service_dump_one(FILE *f, struct rte_service_spec_impl *s, uint32_t reset)
 {
 	/* avoid divide by zero */
-	if (all_cycles == 0)
-		all_cycles = 1;
-
 	int calls = 1;
 	if (s->calls != 0)
 		calls = s->calls;
@@ -807,7 +803,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, reset);
 	return 0;
 }
 
@@ -851,21 +847,13 @@ rte_service_dump(FILE *f, uint32_t id)
 	uint32_t i;
 	int print_one = (id != UINT32_MAX);
 
-	uint64_t total_cycles = 0;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
-			continue;
-		total_cycles += rte_services[i].cycles_spent;
-	}
-
 	/* print only the specified service */
 	if (print_one) {
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, reset);
 		return 0;
 	}
 
@@ -875,7 +863,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 4/6] service: remove redundant code
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
                           ` (2 preceding siblings ...)
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 3/6] service: remove rte prefix from static functions Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 5/6] service: optimize with c11 atomics Phil Yang
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd

The service id validation is duplicated, remove the redundant code
in the calling functions.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index ed20702..9c1a1d5 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -541,24 +541,12 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
-		return -EINVAL;
-
-	if (!lcore_states[lcore].is_service_core)
+	/* validate ID, or return error value */
+	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
+	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -587,19 +575,15 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 5/6] service: optimize with c11 atomics
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
                           ` (3 preceding siblings ...)
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 4/6] service: remove redundant code Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 6/6] service: relax barriers with C11 atomics Phil Yang
                           ` (2 subsequent siblings)
  7 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd

The num_mapped_cores is used as a statistics. Use c11 atomics with
RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
enforce unnessary barriers on aarch64.

Replace execute_lock operations to spinlock_try_lock to avoid duplicate
code.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++++--------------
 lib/librte_eal/meson.build          |  4 ++++
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 9c1a1d5..8cac265 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -20,6 +20,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_malloc.h>
+#include <rte_spinlock.h>
 
 #include "eal_private.h"
 
@@ -38,11 +39,11 @@ struct rte_service_spec_impl {
 	/* public part of the struct */
 	struct rte_service_spec spec;
 
-	/* atomic lock that when set indicates a service core is currently
+	/* spin lock that when set indicates a service core is currently
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	rte_spinlock_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +55,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	uint32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +333,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +377,11 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		if (!rte_spinlock_trylock(&s->execute_lock))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		rte_spinlock_unlock(&s->execute_lock);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +417,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	return ret;
 }
@@ -556,19 +558,19 @@ service_update(uint32_t sid, uint32_t lcore,
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -616,7 +618,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -699,7 +702,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 0267c3b..c2d7a69 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -21,3 +21,7 @@ endif
 if cc.has_header('getopt.h')
 	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-DHAVE_GETOPT_LONG']
 endif
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2 6/6] service: relax barriers with C11 atomics
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
                           ` (4 preceding siblings ...)
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 5/6] service: optimize with c11 atomics Phil Yang
@ 2020-04-23 16:31         ` Phil Yang
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  7 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:31 UTC (permalink / raw)
  To: harry.van.haaren, dev
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd

The runstate, comp_runstate and app_runstate are used as guard variables
in the service core lib. To guarantee the inter-threads visibility of
these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
built-ins to relax these barriers.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 115 ++++++++++++++++++++++++++----------
 1 file changed, 84 insertions(+), 31 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 8cac265..dbb8211 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -265,7 +265,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
 	rte_service_count++;
 
 	if (id_ptr)
@@ -282,7 +281,6 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
@@ -301,12 +299,17 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* comp_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run and service_runstate_get function.
+	 */
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,12 +319,17 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* app_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run runstate_get function.
+	 */
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -330,15 +338,24 @@ rte_service_runstate_get(uint32_t id)
 {
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
-	rte_smp_rmb();
 
-	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING &&
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
+		int check_disabled = !(s->internal_flags &
+					SERVICE_F_START_CHECK);
+		int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
 					    __ATOMIC_RELAXED) > 0);
 
-	return (s->app_runstate == RUNSTATE_RUNNING) &&
-		(s->comp_runstate == RUNSTATE_RUNNING) &&
-		(check_disabled | lcore_mapped);
+		return (check_disabled | lcore_mapped);
+	} else
+		return 0;
+
 }
 
 static inline void
@@ -367,9 +384,15 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	if (!s)
 		return -EINVAL;
 
-	if (s->comp_runstate != RUNSTATE_RUNNING ||
-			s->app_runstate != RUNSTATE_RUNNING ||
-			!(service_mask & (UINT64_C(1) << i))) {
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		!(service_mask & (UINT64_C(1) << i))) {
 		cs->service_active_on_lcore[i] = 0;
 		return -ENOEXEC;
 	}
@@ -434,7 +457,12 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (cs->runstate == RUNSTATE_RUNNING) {
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	while (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -445,8 +473,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -614,15 +640,18 @@ rte_service_lcore_reset_all(void)
 		if (lcore_states[i].is_service_core) {
 			lcore_states[i].service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
-			lcore_states[i].runstate = RUNSTATE_STOPPED;
+			/* runstate act as guard variable Use
+			 * store-release memory order here to synchronize
+			 * with load-acquire in runstate read functions.
+			 */
+			__atomic_store_n(&lcore_states[i].runstate,
+				RUNSTATE_STOPPED, __ATOMIC_RELEASE);
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
 		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
 				    __ATOMIC_RELAXED);
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -638,9 +667,11 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
@@ -655,7 +686,12 @@ rte_service_lcore_del(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate != RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_STOPPED)
 		return -EBUSY;
 
 	set_lcore_state(lcore, ROLE_RTE);
@@ -674,13 +710,21 @@ rte_service_lcore_start(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate == RUNSTATE_RUNNING)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING)
 		return -EALREADY;
 
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	cs->runstate = RUNSTATE_RUNNING;
+	/* Use load-acquire memory order here to synchronize with
+	 * store-release in runstate update functions.
+	 */
+	__atomic_store_n(&cs->runstate, RUNSTATE_RUNNING, __ATOMIC_RELEASE);
 
 	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
@@ -693,7 +737,12 @@ rte_service_lcore_stop(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	if (lcore_states[lcore].runstate == RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&lcore_states[lcore].runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
@@ -713,7 +762,11 @@ rte_service_lcore_stop(uint32_t lcore)
 			return -EBUSY;
 	}
 
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-04-23 16:54       ` " Phil Yang
  2020-04-27  8:57         ` Maxime Coquelin
  2020-04-28 16:06         ` Maxime Coquelin
  0 siblings, 2 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 16:54 UTC (permalink / raw)
  To: maxime.coquelin, zhihong.wang, xiaolong.ye, dev
  Cc: thomas, Honnappa.Nagarahalli, gavin.hu, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
v2:
split from the 'generic rte atomic APIs deprecate proposal' patchset.

 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..fa10deb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int16_t success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
  2020-03-23 18:48       ` Ananyev, Konstantin
@ 2020-04-23 17:16       ` " Phil Yang
  2020-04-23 17:45         ` Jerin Jacob
                           ` (2 more replies)
  1 sibling, 3 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-23 17:16 UTC (permalink / raw)
  To: konstantin.ananyev, dev
  Cc: thomas, bernard.iremonger, vladimir.medvedkin,
	Honnappa.Nagarahalli, gavin.hu, ruifeng.wang, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v2:
split from the "generic rte atomic APIs deprecate proposal" patchset.


 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/meson.build | 5 +++++
 lib/librte_ipsec/sa.h        | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/meson.build b/lib/librte_ipsec/meson.build
index fc69970..9335f28 100644
--- a/lib/librte_ipsec/meson.build
+++ b/lib/librte_ipsec/meson.build
@@ -6,3 +6,8 @@ sources = files('esp_inb.c', 'esp_outb.c', 'sa.c', 'ses.c', 'ipsec_sad.c')
 headers = files('rte_ipsec.h', 'rte_ipsec_group.h', 'rte_ipsec_sa.h', 'rte_ipsec_sad.h')
 
 deps += ['mbuf', 'net', 'cryptodev', 'security', 'hash']
+
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-23 17:16       ` [dpdk-dev] [PATCH v2] " Phil Yang
@ 2020-04-23 17:45         ` Jerin Jacob
  2020-04-24  4:49           ` Phil Yang
  2020-04-23 18:10         ` Ananyev, Konstantin
  2020-04-24  4:33         ` [dpdk-dev] [PATCH v3] " Phil Yang
  2 siblings, 1 reply; 219+ messages in thread
From: Jerin Jacob @ 2020-04-23 17:45 UTC (permalink / raw)
  To: Phil Yang
  Cc: Ananyev, Konstantin, dpdk-dev, Thomas Monjalon,
	Bernard Iremonger, Vladimir Medvedkin, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang (Arm Technology China),
	nd

On Thu, Apr 23, 2020 at 10:47 PM Phil Yang <phil.yang@arm.com> wrote:
>
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. This introduced an unnecessary full barrier by calling
> the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> patch optimized it with c11 atomic and eliminated the expensive barrier
> for aarch64.
>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>

> diff --git a/lib/librte_ipsec/meson.build b/lib/librte_ipsec/meson.build
> index fc69970..9335f28 100644
> --- a/lib/librte_ipsec/meson.build
> +++ b/lib/librte_ipsec/meson.build
> @@ -6,3 +6,8 @@ sources = files('esp_inb.c', 'esp_outb.c', 'sa.c', 'ses.c', 'ipsec_sad.c')
>  headers = files('rte_ipsec.h', 'rte_ipsec_group.h', 'rte_ipsec_sa.h', 'rte_ipsec_sad.h')
>
>  deps += ['mbuf', 'net', 'cryptodev', 'security', 'hash']
> +
> +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> +    ext_deps += cc.find_library('atomic')
> +endif


The following patch has been merged in master now. You don't need this anymore.

commit da4eae278b56e698c64d0c39939a7a55c5b6abdd
Author: Pavan Nikhilesh <pbhagavatula@marvell.com>
Date:   Sun Apr 19 15:31:01 2020 +0530

    build: add global libatomic dependency for 32-bit clang

    Add libatomic as a global dependency when compiling for 32-bit using
    clang. As we need libatomic for 64-bit atomic ops.

    Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
    Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-23 17:16       ` [dpdk-dev] [PATCH v2] " Phil Yang
  2020-04-23 17:45         ` Jerin Jacob
@ 2020-04-23 18:10         ` Ananyev, Konstantin
  2020-04-24  4:35           ` Phil Yang
  2020-04-24  4:33         ` [dpdk-dev] [PATCH v3] " Phil Yang
  2 siblings, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-04-23 18:10 UTC (permalink / raw)
  To: Phil Yang, dev
  Cc: thomas, Iremonger, Bernard, Medvedkin, Vladimir,
	Honnappa.Nagarahalli, gavin.hu, ruifeng.wang, nd

> 
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. This introduced an unnecessary full barrier by calling
> the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> patch optimized it with c11 atomic and eliminated the expensive barrier
> for aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
> v2:
> split from the "generic rte atomic APIs deprecate proposal" patchset.
> 
> 
>  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
>  lib/librte_ipsec/meson.build | 5 +++++
>  lib/librte_ipsec/sa.h        | 2 +-
>  3 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
> index 0c2f76a..e884af7 100644
> --- a/lib/librte_ipsec/ipsec_sqn.h
> +++ b/lib/librte_ipsec/ipsec_sqn.h
> @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
> 
>  	n = *num;
>  	if (SQN_ATOMIC(sa))
> -		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
> +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> +			__ATOMIC_RELAXED);
>  	else {
>  		sqn = sa->sqn.outb.raw + n;
>  		sa->sqn.outb.raw = sqn;
> diff --git a/lib/librte_ipsec/meson.build b/lib/librte_ipsec/meson.build
> index fc69970..9335f28 100644
> --- a/lib/librte_ipsec/meson.build
> +++ b/lib/librte_ipsec/meson.build
> @@ -6,3 +6,8 @@ sources = files('esp_inb.c', 'esp_outb.c', 'sa.c', 'ses.c', 'ipsec_sad.c')
>  headers = files('rte_ipsec.h', 'rte_ipsec_group.h', 'rte_ipsec_sa.h', 'rte_ipsec_sad.h')
> 
>  deps += ['mbuf', 'net', 'cryptodev', 'security', 'hash']
> +
> +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> +    ext_deps += cc.find_library('atomic')
> +endif
> diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
> index d22451b..cab9a2e 100644
> --- a/lib/librte_ipsec/sa.h
> +++ b/lib/librte_ipsec/sa.h
> @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
>  	 */
>  	union {
>  		union {
> -			rte_atomic64_t atom;
> +			uint64_t atom;
>  			uint64_t raw;
>  		} outb;
>  		struct {

Seems  you missed my comments for previous version, so I put here:

If we don't need rte_atomic64 here anymore,
then I think we can collapse the union to just:
uint64_t outb;

Konstantin

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-23 17:16       ` [dpdk-dev] [PATCH v2] " Phil Yang
  2020-04-23 17:45         ` Jerin Jacob
  2020-04-23 18:10         ` Ananyev, Konstantin
@ 2020-04-24  4:33         ` " Phil Yang
  2020-04-24 11:17           ` Ananyev, Konstantin
  2 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-04-24  4:33 UTC (permalink / raw)
  To: konstantin.ananyev, dev
  Cc: thomas, jerinj, akhil.goyal, bernard.iremonger,
	vladimir.medvedkin, Honnappa.Nagarahalli, gavin.hu, ruifeng.wang,
	nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. Use c11 atomics with RELAXED ordering for outbound SQN
update instead of rte_atomic ops which enforce unnecessary barriers on
aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v3:
1. since libatomic dependency for 32-bit clang added globally, so remove
the redundant code.
2. collapse union outb to unint64_t outb as the rte_atomic is no needed.

v2:
split from the "generic rte atomic APIs deprecate proposal" patchset.

 lib/librte_ipsec/ipsec_sqn.h | 6 +++---
 lib/librte_ipsec/sa.c        | 2 +-
 lib/librte_ipsec/sa.h        | 5 +----
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..2636cb1 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,10 +128,10 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb, n, __ATOMIC_RELAXED);
 	else {
-		sqn = sa->sqn.outb.raw + n;
-		sa->sqn.outb.raw = sqn;
+		sqn = sa->sqn.outb + n;
+		sa->sqn.outb = sqn;
 	}
 
 	/* overflow */
diff --git a/lib/librte_ipsec/sa.c b/lib/librte_ipsec/sa.c
index ada195c..e59189d 100644
--- a/lib/librte_ipsec/sa.c
+++ b/lib/librte_ipsec/sa.c
@@ -283,7 +283,7 @@ esp_outb_init(struct rte_ipsec_sa *sa, uint32_t hlen)
 {
 	uint8_t algo_type;
 
-	sa->sqn.outb.raw = 1;
+	sa->sqn.outb = 1;
 
 	algo_type = sa->algo_type;
 
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..508dd2b 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -119,10 +119,7 @@ struct rte_ipsec_sa {
 	 * place from other frequently accesed data.
 	 */
 	union {
-		union {
-			rte_atomic64_t atom;
-			uint64_t raw;
-		} outb;
+		uint64_t outb;
 		struct {
 			uint32_t rdidx; /* read index */
 			uint32_t wridx; /* write index */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-23 18:10         ` Ananyev, Konstantin
@ 2020-04-24  4:35           ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-24  4:35 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev
  Cc: thomas, Iremonger, Bernard, Medvedkin, Vladimir,
	Honnappa Nagarahalli, Gavin Hu, Ruifeng Wang, nd, nd

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Friday, April 24, 2020 2:11 AM
> To: Phil Yang <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Iremonger, Bernard
> <bernard.iremonger@intel.com>; Medvedkin, Vladimir
> <vladimir.medvedkin@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn
> update
> 
> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. This introduced an unnecessary full barrier by calling
> > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > patch optimized it with c11 atomic and eliminated the expensive barrier
> > for aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> > v2:
> > split from the "generic rte atomic APIs deprecate proposal" patchset.
> >
> >
> >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> >  lib/librte_ipsec/meson.build | 5 +++++
> >  lib/librte_ipsec/sa.h        | 2 +-
> >  3 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
> > index 0c2f76a..e884af7 100644
> > --- a/lib/librte_ipsec/ipsec_sqn.h
> > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> uint32_t *num)
> >
> >  	n = *num;
> >  	if (SQN_ATOMIC(sa))
> > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> >sqn.outb.atom, n);
> > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > +			__ATOMIC_RELAXED);
> >  	else {
> >  		sqn = sa->sqn.outb.raw + n;
> >  		sa->sqn.outb.raw = sqn;
> > diff --git a/lib/librte_ipsec/meson.build b/lib/librte_ipsec/meson.build
> > index fc69970..9335f28 100644
> > --- a/lib/librte_ipsec/meson.build
> > +++ b/lib/librte_ipsec/meson.build
> > @@ -6,3 +6,8 @@ sources = files('esp_inb.c', 'esp_outb.c', 'sa.c', 'ses.c',
> 'ipsec_sad.c')
> >  headers = files('rte_ipsec.h', 'rte_ipsec_group.h', 'rte_ipsec_sa.h',
> 'rte_ipsec_sad.h')
> >
> >  deps += ['mbuf', 'net', 'cryptodev', 'security', 'hash']
> > +
> > +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> > +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> > +    ext_deps += cc.find_library('atomic')
> > +endif
> > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
> > index d22451b..cab9a2e 100644
> > --- a/lib/librte_ipsec/sa.h
> > +++ b/lib/librte_ipsec/sa.h
> > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> >  	 */
> >  	union {
> >  		union {
> > -			rte_atomic64_t atom;
> > +			uint64_t atom;
> >  			uint64_t raw;
> >  		} outb;
> >  		struct {
> 
> Seems  you missed my comments for previous version, so I put here:
> 
> If we don't need rte_atomic64 here anymore,
> then I think we can collapse the union to just:
> uint64_t outb;
My bad, I missed this comment.
Updated in v3.  Please review it.

Thanks,
Phil

> 
> Konstantin

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-23 17:45         ` Jerin Jacob
@ 2020-04-24  4:49           ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-24  4:49 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Ananyev, Konstantin, dpdk-dev, thomas, Bernard Iremonger,
	Vladimir Medvedkin, Honnappa Nagarahalli, Gavin Hu, Ruifeng Wang,
	nd, nd

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, April 24, 2020 1:45 AM
> To: Phil Yang <Phil.Yang@arm.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dpdk-dev
> <dev@dpdk.org>; thomas@monjalon.net; Bernard Iremonger
> <bernard.iremonger@intel.com>; Vladimir Medvedkin
> <vladimir.medvedkin@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2] ipsec: optimize with c11 atomic for sa
> outbound sqn update
> 
> On Thu, Apr 23, 2020 at 10:47 PM Phil Yang <phil.yang@arm.com> wrote:
> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. This introduced an unnecessary full barrier by calling
> > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > patch optimized it with c11 atomic and eliminated the expensive barrier
> > for aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> > diff --git a/lib/librte_ipsec/meson.build b/lib/librte_ipsec/meson.build
> > index fc69970..9335f28 100644
> > --- a/lib/librte_ipsec/meson.build
> > +++ b/lib/librte_ipsec/meson.build
> > @@ -6,3 +6,8 @@ sources = files('esp_inb.c', 'esp_outb.c', 'sa.c', 'ses.c',
> 'ipsec_sad.c')
> >  headers = files('rte_ipsec.h', 'rte_ipsec_group.h', 'rte_ipsec_sa.h',
> 'rte_ipsec_sad.h')
> >
> >  deps += ['mbuf', 'net', 'cryptodev', 'security', 'hash']
> > +
> > +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> > +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> > +    ext_deps += cc.find_library('atomic')
> > +endif
> 
> 
> The following patch has been merged in master now. You don't need this
> anymore.
> 
> commit da4eae278b56e698c64d0c39939a7a55c5b6abdd
> Author: Pavan Nikhilesh <pbhagavatula@marvell.com>
> Date:   Sun Apr 19 15:31:01 2020 +0530
> 
>     build: add global libatomic dependency for 32-bit clang
> 
>     Add libatomic as a global dependency when compiling for 32-bit using
>     clang. As we need libatomic for 64-bit atomic ops.
> 
>     Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
>     Acked-by: Bruce Richardson <bruce.richardson@intel.com>

Great, we don't need to add it module by module anymore. 
Updated in v3. Thank you very much.

Thanks,
Phil



^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
@ 2020-04-24  6:08       ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-24  6:08 UTC (permalink / raw)
  To: Phil Yang, thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, nd

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, March 17, 2020 9:18 AM
> To: thomas@monjalon.net; harry.van.haaren@intel.com;
> konstantin.ananyev@intel.com; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency
> for 32-bit clang
> 
> When compiling with clang on 32-bit platforms, we are missing copies
> of 64-bit atomic functions. We can solve this by linking against
> libatomic for the drivers and libs which need those atomic ops.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_eal/meson.build | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
> index 4be5118..3b10eae 100644
> --- a/lib/librte_eal/meson.build
> +++ b/lib/librte_eal/meson.build
> @@ -20,6 +20,12 @@ endif
>  if cc.has_function('getentropy', prefix : '#include <unistd.h>')
>  	cflags += '-DRTE_LIBEAL_USE_GETENTROPY'
>  endif
> +
> +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> +    ext_deps += cc.find_library('atomic')
> +endif
> +

This should be unneeded since:
https://git.dpdk.org/dpdk/commit/?id=da4eae278b56e698c64d0c39939a7a55c5b6abdd

Thanks,
Phil Yang

>  sources = common_sources + env_sources
>  objs = common_objs + env_objs
>  headers = common_headers + env_headers
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 04/12] build: remove redundant code
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
@ 2020-04-24  6:14       ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-04-24  6:14 UTC (permalink / raw)
  To: Phil Yang, thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa Nagarahalli,
	Gavin Hu, Ruifeng Wang, Joyce Kong, nd, nd

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang
> Sent: Tuesday, March 17, 2020 9:18 AM
> To: thomas@monjalon.net; harry.van.haaren@intel.com;
> konstantin.ananyev@intel.com; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu <Gavin.Hu@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; nd <nd@arm.com>
> Subject: [dpdk-dev] [PATCH v3 04/12] build: remove redundant code
> 
> All these libs and drivers are built upon the eal lib. So when
> compiling with clang on 32-bit platforms linking against libatomic
> for the eal lib is sufficient. Remove the redundant code.
> 
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/event/octeontx/meson.build  | 5 -----
>  drivers/event/octeontx2/meson.build | 5 -----
>  drivers/event/opdl/meson.build      | 5 -----
>  lib/librte_rcu/meson.build          | 5 -----
>  4 files changed, 20 deletions(-)

This should be unneeded since:
https://git.dpdk.org/dpdk/commit/?id=da4eae278b56e698c64d0c39939a7a55c5b6abdd

Abandon this patch.

Thanks,
Phil Yang

<snip>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-24  4:33         ` [dpdk-dev] [PATCH v3] " Phil Yang
@ 2020-04-24 11:17           ` Ananyev, Konstantin
  2020-05-09 21:51             ` Akhil Goyal
  0 siblings, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-04-24 11:17 UTC (permalink / raw)
  To: Phil Yang, dev
  Cc: thomas, jerinj, akhil.goyal, Iremonger, Bernard, Medvedkin,
	Vladimir, Honnappa.Nagarahalli, gavin.hu, ruifeng.wang, nd

> 
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. Use c11 atomics with RELAXED ordering for outbound SQN
> update instead of rte_atomic ops which enforce unnecessary barriers on
> aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] vhost: optimize broadcast rarp sync with c11 atomic
  2020-04-23 16:54       ` [dpdk-dev] [PATCH v2] " Phil Yang
@ 2020-04-27  8:57         ` Maxime Coquelin
  2020-04-28 16:06         ` Maxime Coquelin
  1 sibling, 0 replies; 219+ messages in thread
From: Maxime Coquelin @ 2020-04-27  8:57 UTC (permalink / raw)
  To: Phil Yang, zhihong.wang, xiaolong.ye, dev
  Cc: thomas, Honnappa.Nagarahalli, gavin.hu, joyce.kong, nd



On 4/23/20 6:54 PM, Phil Yang wrote:
> The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
> which is a full barrier, DMB, on aarch64. This patch optimized it with
> c11 atomic one-way barrier.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Joyce Kong <joyce.kong@arm.com>
> ---
> v2:
> split from the 'generic rte atomic APIs deprecate proposal' patchset.
> 
>  lib/librte_vhost/vhost.h      |  2 +-
>  lib/librte_vhost/vhost_user.c |  7 +++----
>  lib/librte_vhost/virtio_net.c | 16 +++++++++-------
>  3 files changed, 13 insertions(+), 12 deletions(-)

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2] vhost: optimize broadcast rarp sync with c11 atomic
  2020-04-23 16:54       ` [dpdk-dev] [PATCH v2] " Phil Yang
  2020-04-27  8:57         ` Maxime Coquelin
@ 2020-04-28 16:06         ` Maxime Coquelin
  1 sibling, 0 replies; 219+ messages in thread
From: Maxime Coquelin @ 2020-04-28 16:06 UTC (permalink / raw)
  To: Phil Yang, zhihong.wang, xiaolong.ye, dev
  Cc: thomas, Honnappa.Nagarahalli, gavin.hu, joyce.kong, nd



On 4/23/20 6:54 PM, Phil Yang wrote:
> The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
> which is a full barrier, DMB, on aarch64. This patch optimized it with
> c11 atomic one-way barrier.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Joyce Kong <joyce.kong@arm.com>
> ---
> v2:
> split from the 'generic rte atomic APIs deprecate proposal' patchset.
> 
>  lib/librte_vhost/vhost.h      |  2 +-
>  lib/librte_vhost/vhost_user.c |  7 +++----
>  lib/librte_vhost/virtio_net.c | 16 +++++++++-------
>  3 files changed, 13 insertions(+), 12 deletions(-)

Applied to dpdk-next-virtio/master

Thanks,
Maxime


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service Phil Yang
@ 2020-04-29 16:51           ` Van Haaren, Harry
  2020-04-29 22:48             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-04-29 16:51 UTC (permalink / raw)
  To: Phil Yang, dev
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Honnappa.Nagarahalli, gavin.hu, nd,
	Honnappa Nagarahalli, stable, Eads, Gage, Richardson, Bruce

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Thursday, April 23, 2020 5:31 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Honnappa.Nagarahalli@arm.com;
> gavin.hu@arm.com; nd@arm.com; Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>; stable@dpdk.org
> Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe service
> 
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> The MT unsafe service might get configured to run on another core
> while the service is running currently. This might result in the
> MT unsafe service running on multiple cores simultaneously. Use
> 'execute_lock' always when the service is MT unsafe.
> 
> Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>
> ---

Thanks for spinning a new revision - based on ML discussion previously,
it seems like the "use service-run-count" to avoid this race would be a
complex solution.

Suggesting the following;
1) Take the approach as per this patch, to always take the atomic, fixing the race condition.
2) Add an API to service-cores, which allows "committing" of mappings. Committing the mapping would imply that the mappings will not be changed in future. With runtime-remapping being removed from the equation, the existing branch-over-atomic optimization is valid again.

So this would offer applications two situations
A) No application change: possible performance regression due to atomic always taken.
B) Call "commit" API, and regain the performance as per previous DPDK versions.

Thoughts/opinions on the above?  I've flagged the rest of the patchset for review ASAP. Regards, -Harry

>  lib/librte_eal/common/rte_service.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/rte_service.c
> b/lib/librte_eal/common/rte_service.c
> index 70d17a5..b8c465e 100644
> --- a/lib/librte_eal/common/rte_service.c
> +++ b/lib/librte_eal/common/rte_service.c
> @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
>  	uint8_t internal_flags;
> 
>  	/* per service statistics */
> +	/* Indicates how many cores the service is mapped to run on.
> +	 * It does not indicate the number of cores the service is running
> +	 * on currently.
> +	 */
>  	rte_atomic32_t num_mapped_cores;
>  	uint64_t calls;
>  	uint64_t cycles_spent;
> @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t
> service_mask,
> 
>  	cs->service_active_on_lcore[i] = 1;
> 
> -	/* check do we need cmpset, if MT safe or <= 1 core
> -	 * mapped, atomic ops are not required.
> -	 */
> -	const int use_atomics = (service_mt_safe(s) == 0) &&
> -				(rte_atomic32_read(&s->num_mapped_cores) > 1);
> -	if (use_atomics) {
> +	if (service_mt_safe(s) == 0) {
>  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
>  			return -EBUSY;
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-04-29 16:51           ` Van Haaren, Harry
@ 2020-04-29 22:48             ` Honnappa Nagarahalli
  2020-05-01 14:21               ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-04-29 22:48 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, dev
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Gavin Hu, nd, stable, Eads, Gage, Richardson,
	Bruce, Honnappa Nagarahalli, nd

Hi Harry,
	Thanks for getting back on this.

<snip>

> > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe
> > service
> >
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > The MT unsafe service might get configured to run on another core
> > while the service is running currently. This might result in the MT
> > unsafe service running on multiple cores simultaneously. Use
> > 'execute_lock' always when the service is MT unsafe.
> >
> > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > ---
> 
> Thanks for spinning a new revision - based on ML discussion previously, it
> seems like the "use service-run-count" to avoid this race would be a complex
> solution.
> 
> Suggesting the following;
> 1) Take the approach as per this patch, to always take the atomic, fixing the
> race condition.
Ok

> 2) Add an API to service-cores, which allows "committing" of mappings.
> Committing the mapping would imply that the mappings will not be changed
> in future. With runtime-remapping being removed from the equation, the
> existing branch-over-atomic optimization is valid again.
Ok. Just to make sure I understand this:
a) on the data plane, if commit API is called (probably a new state variable) and num_mapped_cores is set to 1, there is no need to take the lock.
b) possible implementation of the commit API would check if num_mapped_cores for the service is set to 1 and set a variable to indicate that the lock is not required.

What do you think about asking the application to set  the service capability to MT_SAFE if it knows that the service will run on a single core? This would require us to change the documentation and does not require additional code.

> 
> So this would offer applications two situations
> A) No application change: possible performance regression due to atomic
> always taken.
> B) Call "commit" API, and regain the performance as per previous DPDK
> versions.
> 
> Thoughts/opinions on the above?  I've flagged the rest of the patchset for
> review ASAP. Regards, -Harry
> 
> >  lib/librte_eal/common/rte_service.c | 11 +++++------
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/rte_service.c
> > b/lib/librte_eal/common/rte_service.c
> > index 70d17a5..b8c465e 100644
> > --- a/lib/librte_eal/common/rte_service.c
> > +++ b/lib/librte_eal/common/rte_service.c
> > @@ -50,6 +50,10 @@ struct rte_service_spec_impl {
> >  	uint8_t internal_flags;
> >
> >  	/* per service statistics */
> > +	/* Indicates how many cores the service is mapped to run on.
> > +	 * It does not indicate the number of cores the service is running
> > +	 * on currently.
> > +	 */
> >  	rte_atomic32_t num_mapped_cores;
> >  	uint64_t calls;
> >  	uint64_t cycles_spent;
> > @@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs,
> > uint64_t service_mask,
> >
> >  	cs->service_active_on_lcore[i] = 1;
> >
> > -	/* check do we need cmpset, if MT safe or <= 1 core
> > -	 * mapped, atomic ops are not required.
> > -	 */
> > -	const int use_atomics = (service_mt_safe(s) == 0) &&
> > -				(rte_atomic32_read(&s-
> >num_mapped_cores) > 1);
> > -	if (use_atomics) {
> > +	if (service_mt_safe(s) == 0) {
> >  		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
> >  			return -EBUSY;
> >
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-04-29 22:48             ` Honnappa Nagarahalli
@ 2020-05-01 14:21               ` Van Haaren, Harry
  2020-05-01 14:56                 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-01 14:21 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, dev
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Gavin Hu, nd, stable, Eads, Gage, Richardson,
	Bruce, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, April 29, 2020 11:49 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; Eads, Gage <gage.eads@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/6] service: fix race condition for MT unsafe service
> 
> Hi Harry,
> 	Thanks for getting back on this.
> 
> <snip>
> 
> > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe
> > > service
> > >
> > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >
> > > The MT unsafe service might get configured to run on another core
> > > while the service is running currently. This might result in the MT
> > > unsafe service running on multiple cores simultaneously. Use
> > > 'execute_lock' always when the service is MT unsafe.
> > >
> > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > ---
> >
> > Thanks for spinning a new revision - based on ML discussion previously, it
> > seems like the "use service-run-count" to avoid this race would be a complex
> > solution.
> >
> > Suggesting the following;
> > 1) Take the approach as per this patch, to always take the atomic, fixing the
> > race condition.
> Ok

I've micro-benchmarked this code change inside the service cores autotest, and it
introduces around 35 cycles of overhead per service call.  This is not ideal, but given
it's a bugfix, and by far the simplest method to fix this race-condition. Having
discussed and investigated multiple other solutions, I believe this is the right solution.
Thanks Honnappa and Phil for identifying and driving a solution.

I suggest to post the benchmarking unit-test addition patch, and integrate that
*before* the series under review here gets merged? This makes benchmarking
of the "before bugfix" performance in future easier should it be required.


> > 2) Add an API to service-cores, which allows "committing" of mappings.
> > Committing the mapping would imply that the mappings will not be changed
> > in future. With runtime-remapping being removed from the equation, the
> > existing branch-over-atomic optimization is valid again.
> Ok. Just to make sure I understand this:
> a) on the data plane, if commit API is called (probably a new state variable) and
> num_mapped_cores is set to 1, there is no need to take the lock.
> b) possible implementation of the commit API would check if
> num_mapped_cores for the service is set to 1 and set a variable to indicate that
> the lock is not required.
> 
> What do you think about asking the application to set  the service capability to
> MT_SAFE if it knows that the service will run on a single core? This would
> require us to change the documentation and does not require additional code.

That's a nice idea - I like that if applications want to micro-optimize around
the atomic, that they have a workaround/solution to do so, particularly that it
doesn't require code-changes and backporting.

Will send review and send feedback on the patches themselves.
Regards, -Harry

> > So this would offer applications two situations
> > A) No application change: possible performance regression due to atomic
> > always taken.
> > B) Call "commit" API, and regain the performance as per previous DPDK
> > versions.
> >
> > Thoughts/opinions on the above?  I've flagged the rest of the patchset for
> > review ASAP. Regards, -Harry
> >
> > >  lib/librte_eal/common/rte_service.c | 11 +++++------
> > >  1 file changed, 5 insertions(+), 6 deletions(-)
<snip patch changes>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-05-01 14:21               ` Van Haaren, Harry
@ 2020-05-01 14:56                 ` Honnappa Nagarahalli
  2020-05-01 17:51                   ` Van Haaren, Harry
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-01 14:56 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, dev
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Gavin Hu, nd, stable, Eads, Gage, Richardson,
	Bruce, nd, Honnappa Nagarahalli, nd

<snip>
> >
> > > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe
> > > > service
> > > >
> > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > >
> > > > The MT unsafe service might get configured to run on another core
> > > > while the service is running currently. This might result in the
> > > > MT unsafe service running on multiple cores simultaneously. Use
> > > > 'execute_lock' always when the service is MT unsafe.
> > > >
> > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > ---
> > >
> > > Thanks for spinning a new revision - based on ML discussion
> > > previously, it seems like the "use service-run-count" to avoid this
> > > race would be a complex solution.
> > >
> > > Suggesting the following;
> > > 1) Take the approach as per this patch, to always take the atomic,
> > > fixing the race condition.
> > Ok
> 
> I've micro-benchmarked this code change inside the service cores autotest,
> and it introduces around 35 cycles of overhead per service call.  This is not
> ideal, but given it's a bugfix, and by far the simplest method to fix this race-
> condition. Having discussed and investigated multiple other solutions, I
> believe this is the right solution.
> Thanks Honnappa and Phil for identifying and driving a solution.
You are welcome. Thank you for your timely responses.

> 
> I suggest to post the benchmarking unit-test addition patch, and integrate
> that
> *before* the series under review here gets merged? This makes
> benchmarking of the "before bugfix" performance in future easier should it be
> required.
I do not see any issues, would be happy to review. I think we still have time to catch up with RC2 (May 8th).
You had also mentioned about calling out that, the control plane APIs are not MT safe. Should I add that to this patch?

> 
> 
> > > 2) Add an API to service-cores, which allows "committing" of mappings.
> > > Committing the mapping would imply that the mappings will not be
> > > changed in future. With runtime-remapping being removed from the
> > > equation, the existing branch-over-atomic optimization is valid again.
> > Ok. Just to make sure I understand this:
> > a) on the data plane, if commit API is called (probably a new state
> > variable) and num_mapped_cores is set to 1, there is no need to take the
> lock.
> > b) possible implementation of the commit API would check if
> > num_mapped_cores for the service is set to 1 and set a variable to
> > indicate that the lock is not required.
> >
> > What do you think about asking the application to set  the service
> > capability to MT_SAFE if it knows that the service will run on a
> > single core? This would require us to change the documentation and does
> not require additional code.
> 
> That's a nice idea - I like that if applications want to micro-optimize around
> the atomic, that they have a workaround/solution to do so, particularly that it
> doesn't require code-changes and backporting.
> 
> Will send review and send feedback on the patches themselves.
> Regards, -Harry
> 
> > > So this would offer applications two situations
> > > A) No application change: possible performance regression due to
> > > atomic always taken.
> > > B) Call "commit" API, and regain the performance as per previous
> > > DPDK versions.
> > >
> > > Thoughts/opinions on the above?  I've flagged the rest of the
> > > patchset for review ASAP. Regards, -Harry
> > >
> > > >  lib/librte_eal/common/rte_service.c | 11 +++++------
> > > >  1 file changed, 5 insertions(+), 6 deletions(-)
> <snip patch changes>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/6] service: fix race condition for MT unsafe service
  2020-05-01 14:56                 ` Honnappa Nagarahalli
@ 2020-05-01 17:51                   ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-01 17:51 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, dev
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Gavin Hu, nd, stable, Eads, Gage, Richardson,
	Bruce, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, May 1, 2020 3:56 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; nd
> <nd@arm.com>; stable@dpdk.org; Eads, Gage <gage.eads@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; nd <nd@arm.com>;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/6] service: fix race condition for MT unsafe service
> 
> <snip>
> > >
> > > > > Subject: [PATCH v2 1/6] service: fix race condition for MT unsafe
> > > > > service
> > > > >
> > > > > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > >
> > > > > The MT unsafe service might get configured to run on another core
> > > > > while the service is running currently. This might result in the
> > > > > MT unsafe service running on multiple cores simultaneously. Use
> > > > > 'execute_lock' always when the service is MT unsafe.
> > > > >
> > > > > Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Phil Yang <phil.yang@arm.com>
> > > > > ---
> > > >
> > > > Thanks for spinning a new revision - based on ML discussion
> > > > previously, it seems like the "use service-run-count" to avoid this
> > > > race would be a complex solution.
> > > >
> > > > Suggesting the following;
> > > > 1) Take the approach as per this patch, to always take the atomic,
> > > > fixing the race condition.
> > > Ok
> >
> > I've micro-benchmarked this code change inside the service cores autotest,
> > and it introduces around 35 cycles of overhead per service call.  This is not
> > ideal, but given it's a bugfix, and by far the simplest method to fix this race-
> > condition. Having discussed and investigated multiple other solutions, I
> > believe this is the right solution.
> > Thanks Honnappa and Phil for identifying and driving a solution.
> You are welcome. Thank you for your timely responses.

Perhaps not so timely after all ... I'll review C11 patches Tuesday morning,
it's a long weekend in Ireland!

> > I suggest to post the benchmarking unit-test addition patch, and integrate
> > that
> > *before* the series under review here gets merged? This makes
> > benchmarking of the "before bugfix" performance in future easier should it be
> > required.
> I do not see any issues, would be happy to review.

Thanks for volunteering, you're on CC when sent, for convenience:
http://patches.dpdk.org/patch/69651/

> I think we still have time to catch up with RC2 (May 8th).

Agree, merge into RC2 would be great.

> You had also mentioned about calling out that, the control plane APIs are not
> MT safe. Should I add that to this patch?

Yes, that'd be great.

<snip discussion details>

Cheers, -Harry

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
                           ` (5 preceding siblings ...)
  2020-04-23 16:31         ` [dpdk-dev] [PATCH v2 6/6] service: relax barriers with C11 atomics Phil Yang
@ 2020-05-02  0:02         ` Honnappa Nagarahalli
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
                             ` (5 more replies)
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  7 siblings, 6 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
Using c11 atomics with explicit memory ordering instead of the rte_atomic
ops and rte_smp barriers for inter-threads synchronization can uplift the
performance on aarch64 and no performance loss on x86.

This patchset contains:
1) fix race condition for MT unsafe service.
2) clean up redundant code.
3) use c11 atomics for service core lib to avoid unnecessary barriers.

v2:
Still waiting on Harry for the final solution on the MT unsafe race
condition issue. But I have incorporated the comments so far.
1. add 'Fixes' tag for bug-fix patches.
2. remove 'Fixes' tag for code cleanup patches.
3. remove unused parameter for service_dump_one function.
4. replace the execute_lock atomic CAS operation to spinlock_try_lock.
5. use c11 atomics with RELAXED memory ordering for num_mapped_cores.
6. relax barriers for guard variables runstate, comp_runstate and
   app_runstate with c11 one-way barriers.

v3:
Sending this version since Phil is on holiday.
1. Updated the API documentation to indicate how the locking
   can be avoided.

Honnappa Nagarahalli (2):
  service: fix race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (4):
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 atomics
  service: relax barriers with C11 atomics

 lib/librte_eal/common/rte_service.c           | 234 ++++++++++--------
 lib/librte_eal/include/rte_service.h          |   8 +-
 .../include/rte_service_component.h           |   6 +-
 lib/librte_eal/meson.build                    |   4 +
 4 files changed, 141 insertions(+), 111 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 2/6] service: identify service running on another core correctly Honnappa Nagarahalli
                             ` (4 subsequent siblings)
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd, stable

The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

If the service is known to be mmapped on a single lcore,
setting the service capability to MT safe will avoid taking
the lock and improve the performance.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c            | 11 +++++------
 lib/librte_eal/include/rte_service.h           |  8 ++++++--
 lib/librte_eal/include/rte_service_component.h |  6 +++++-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5d7..b8c465eb9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index d8701dd4c..3a1c735c5 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -104,12 +104,16 @@ int32_t rte_service_probe_capability(uint32_t id, uint32_t capability);
  * Each core can be added or removed from running a specific service. This
  * function enables or disables *lcore* to run *service_id*.
  *
- * If multiple cores are enabled on a service, an atomic is used to ensure that
- * only one cores runs the service at a time. The exception to this is when
+ * If multiple cores are enabled on a service, a lock is used to ensure that
+ * only one core runs the service at a time. The exception to this is when
  * a service indicates that it is multi-thread safe by setting the capability
  * called RTE_SERVICE_CAP_MT_SAFE. With the multi-thread safe capability set,
  * the service function can be run on multiple threads at the same time.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance by avoiding the use of lock.
+ *
  * @param service_id the service to apply the lcore to
  * @param lcore The lcore that will be mapped to service
  * @param enable Zero to unmap or disable the core, non-zero to enable
diff --git a/lib/librte_eal/include/rte_service_component.h b/lib/librte_eal/include/rte_service_component.h
index 16eab79ee..b75aba11b 100644
--- a/lib/librte_eal/include/rte_service_component.h
+++ b/lib/librte_eal/include/rte_service_component.h
@@ -43,7 +43,7 @@ struct rte_service_spec {
 /**
  * Register a new service.
  *
- * A service represents a component that the requires CPU time periodically to
+ * A service represents a component that requires CPU time periodically to
  * achieve its purpose.
  *
  * For example the eventdev SW PMD requires CPU cycles to perform its
@@ -56,6 +56,10 @@ struct rte_service_spec {
  * *rte_service_component_runstate_set*, which indicates that the service
  * component is ready to be executed.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance.
+ *
  * @param spec The specification of the service to register
  * @param[out] service_id A pointer to a uint32_t, which will be filled in
  *             during registration of the service. It is set to the integers
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 2/6] service: identify service running on another core correctly
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 3/6] service: remove rte prefix from static functions Honnappa Nagarahalli
                             ` (3 subsequent siblings)
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd, stable

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org
Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b8c465eb9..c89472b83 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ rte_service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ rte_service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 3/6] service: remove rte prefix from static functions
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 2/6] service: identify service running on another core correctly Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 4/6] service: remove redundant code Honnappa Nagarahalli
                             ` (2 subsequent siblings)
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

clean up rte prefix from static functions.
remove unused parameter for service_dump_one function.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 34 ++++++++++-------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c89472b83..ed2070267 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -340,7 +340,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -378,10 +378,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -425,14 +425,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (cs->runstate == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -693,9 +693,9 @@ rte_service_lcore_start(uint32_t lcore)
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
+	cs->runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -774,13 +774,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
-		     uint64_t all_cycles, uint32_t reset)
+service_dump_one(FILE *f, struct rte_service_spec_impl *s, uint32_t reset)
 {
 	/* avoid divide by zero */
-	if (all_cycles == 0)
-		all_cycles = 1;
-
 	int calls = 1;
 	if (s->calls != 0)
 		calls = s->calls;
@@ -807,7 +803,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, reset);
 	return 0;
 }
 
@@ -851,21 +847,13 @@ rte_service_dump(FILE *f, uint32_t id)
 	uint32_t i;
 	int print_one = (id != UINT32_MAX);
 
-	uint64_t total_cycles = 0;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
-			continue;
-		total_cycles += rte_services[i].cycles_spent;
-	}
-
 	/* print only the specified service */
 	if (print_one) {
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, reset);
 		return 0;
 	}
 
@@ -875,7 +863,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 4/6] service: remove redundant code
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (2 preceding siblings ...)
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 3/6] service: remove rte prefix from static functions Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 6/6] service: relax barriers with C11 atomics Honnappa Nagarahalli
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The service id validation is duplicated, remove the redundant code
in the calling functions.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index ed2070267..9c1a1d5cd 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -541,24 +541,12 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
-		return -EINVAL;
-
-	if (!lcore_states[lcore].is_service_core)
+	/* validate ID, or return error value */
+	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
+	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -587,19 +575,15 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 5/6] service: optimize with c11 atomics
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (3 preceding siblings ...)
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 4/6] service: remove redundant code Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 6/6] service: relax barriers with C11 atomics Honnappa Nagarahalli
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The num_mapped_cores is used as a statistics. Use c11 atomics with
RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
enforce unnessary barriers on aarch64.

Replace execute_lock operations to spinlock_try_lock to avoid duplicate
code.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++-------------
 lib/librte_eal/meson.build          |  4 ++++
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 9c1a1d5cd..8cac265c9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -20,6 +20,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_malloc.h>
+#include <rte_spinlock.h>
 
 #include "eal_private.h"
 
@@ -38,11 +39,11 @@ struct rte_service_spec_impl {
 	/* public part of the struct */
 	struct rte_service_spec spec;
 
-	/* atomic lock that when set indicates a service core is currently
+	/* spin lock that when set indicates a service core is currently
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	rte_spinlock_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +55,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	uint32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +333,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +377,11 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		if (!rte_spinlock_trylock(&s->execute_lock))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		rte_spinlock_unlock(&s->execute_lock);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +417,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	return ret;
 }
@@ -556,19 +558,19 @@ service_update(uint32_t sid, uint32_t lcore,
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -616,7 +618,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -699,7 +702,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 0267c3b9d..c2d7a6954 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -21,3 +21,7 @@ endif
 if cc.has_header('getopt.h')
 	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-DHAVE_GETOPT_LONG']
 endif
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v3 6/6] service: relax barriers with C11 atomics
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (4 preceding siblings ...)
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
@ 2020-05-02  0:02           ` Honnappa Nagarahalli
  2020-05-05 14:48             ` Van Haaren, Harry
  5 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-02  0:02 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The runstate, comp_runstate and app_runstate are used as guard variables
in the service core lib. To guarantee the inter-threads visibility of
these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
built-ins to relax these barriers.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 115 ++++++++++++++++++++--------
 1 file changed, 84 insertions(+), 31 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 8cac265c9..dbb821139 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -265,7 +265,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
 	rte_service_count++;
 
 	if (id_ptr)
@@ -282,7 +281,6 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
@@ -301,12 +299,17 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* comp_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run and service_runstate_get function.
+	 */
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,12 +319,17 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* app_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run runstate_get function.
+	 */
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -330,15 +338,24 @@ rte_service_runstate_get(uint32_t id)
 {
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
-	rte_smp_rmb();
 
-	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING &&
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
+		int check_disabled = !(s->internal_flags &
+					SERVICE_F_START_CHECK);
+		int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
 					    __ATOMIC_RELAXED) > 0);
 
-	return (s->app_runstate == RUNSTATE_RUNNING) &&
-		(s->comp_runstate == RUNSTATE_RUNNING) &&
-		(check_disabled | lcore_mapped);
+		return (check_disabled | lcore_mapped);
+	} else
+		return 0;
+
 }
 
 static inline void
@@ -367,9 +384,15 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	if (!s)
 		return -EINVAL;
 
-	if (s->comp_runstate != RUNSTATE_RUNNING ||
-			s->app_runstate != RUNSTATE_RUNNING ||
-			!(service_mask & (UINT64_C(1) << i))) {
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		!(service_mask & (UINT64_C(1) << i))) {
 		cs->service_active_on_lcore[i] = 0;
 		return -ENOEXEC;
 	}
@@ -434,7 +457,12 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (cs->runstate == RUNSTATE_RUNNING) {
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	while (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -445,8 +473,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -614,15 +640,18 @@ rte_service_lcore_reset_all(void)
 		if (lcore_states[i].is_service_core) {
 			lcore_states[i].service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
-			lcore_states[i].runstate = RUNSTATE_STOPPED;
+			/* runstate act as guard variable Use
+			 * store-release memory order here to synchronize
+			 * with load-acquire in runstate read functions.
+			 */
+			__atomic_store_n(&lcore_states[i].runstate,
+				RUNSTATE_STOPPED, __ATOMIC_RELEASE);
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
 		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
 				    __ATOMIC_RELAXED);
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -638,9 +667,11 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
@@ -655,7 +686,12 @@ rte_service_lcore_del(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate != RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_STOPPED)
 		return -EBUSY;
 
 	set_lcore_state(lcore, ROLE_RTE);
@@ -674,13 +710,21 @@ rte_service_lcore_start(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate == RUNSTATE_RUNNING)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING)
 		return -EALREADY;
 
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	cs->runstate = RUNSTATE_RUNNING;
+	/* Use load-acquire memory order here to synchronize with
+	 * store-release in runstate update functions.
+	 */
+	__atomic_store_n(&cs->runstate, RUNSTATE_RUNNING, __ATOMIC_RELEASE);
 
 	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
@@ -693,7 +737,12 @@ rte_service_lcore_stop(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	if (lcore_states[lcore].runstate == RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&lcore_states[lcore].runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
@@ -713,7 +762,11 @@ rte_service_lcore_stop(uint32_t lcore)
 			return -EBUSY;
 	}
 
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd, stable

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com; stable@dpdk.org
> Subject: [PATCH v3 1/6] service: fix race condition for MT unsafe service
> 
> The MT unsafe service might get configured to run on another core
> while the service is running currently. This might result in the
> MT unsafe service running on multiple cores simultaneously. Use
> 'execute_lock' always when the service is MT unsafe.
> 
> If the service is known to be mmapped on a single lcore,

mmapped is a typo? Just mapped.

> setting the service capability to MT safe will avoid taking
> the lock and improve the performance.
>
> Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

<snip diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/6] service: identify service running on another core correctly
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 2/6] service: identify service running on another core correctly Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd, stable

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com; stable@dpdk.org
> Subject: [PATCH v3 2/6] service: identify service running on another core
> correctly
>
> The logic to identify if the MT unsafe service is running on another
> core can return -EBUSY spuriously. In such cases, running the service
> becomes costlier than using atomic operations. Assume that the
> application passes the right parameters and reduces the number of
> instructions for all cases.
> 
> Cc: stable@dpdk.org
> Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Add "fix" to the title, suggestion:
service: fix identification of service running on other lcore

> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Phil Yang <phil.yang@arm.com>

I believe there may be some optimizations we can apply after this patchset
as the "num_mapped_cores" variable is no longer used in a significant way 
for the atomic selection, however lets leave that optimization outside
of 20.05 scope.

With title (see above) & comment (see below) updated.
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

> ---
<snip some diff>
> @@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id,
> uint32_t serialize_mt_unsafe)
> 
>  	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
> 
> -	/* Atomically add this core to the mapped cores first, then examine if
> -	 * we can run the service. This avoids a race condition between
> -	 * checking the value, and atomically adding to the mapped count.
> +	/* Increment num_mapped_cores to indicate that the service
> +	 * is running on a core.
>  	 */
> -	if (serialize_mt_unsafe)
> -		rte_atomic32_inc(&s->num_mapped_cores);
> +	rte_atomic32_inc(&s->num_mapped_cores);

The comment for the added lines here are a little confusing to me,
the "num_mapped_cores" does not indicate that the service "is running on a core",
it indicates the number of mapped lcores to that service. Suggestion below?

/* Increment num_mapped_cores to reflect that this core is
 * now mapped capable of running the service.
 */

> -	if (service_mt_safe(s) == 0 &&
> -			rte_atomic32_read(&s->num_mapped_cores) > 1) {
> -		if (serialize_mt_unsafe)
> -			rte_atomic32_dec(&s->num_mapped_cores);
> -		return -EBUSY;
> -	}
> -
> -	int ret = service_run(id, cs, UINT64_MAX, s);
> +	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);

<snip rest of diff>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/6] service: remove rte prefix from static functions
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 3/6] service: remove rte prefix from static functions Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com
> Subject: [PATCH v3 3/6] service: remove rte prefix from static functions
> 
> From: Phil Yang <phil.yang@arm.com>
> 
> clean up rte prefix from static functions.
> remove unused parameter for service_dump_one function.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 4/6] service: remove redundant code
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 4/6] service: remove redundant code Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com
> Subject: [PATCH v3 4/6] service: remove redundant code
> 
> From: Phil Yang <phil.yang@arm.com>
> 
> The service id validation is duplicated, remove the redundant code
> in the calling functions.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

<snip diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 5/6] service: optimize with c11 atomics
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com
> Subject: [PATCH v3 5/6] service: optimize with c11 atomics
> 
> From: Phil Yang <phil.yang@arm.com>
> 
> The num_mapped_cores is used as a statistics. Use c11 atomics with
> RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
> enforce unnessary barriers on aarch64.
> 
> Replace execute_lock operations to spinlock_try_lock to avoid duplicate
> code.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

<snip diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3 6/6] service: relax barriers with C11 atomics
  2020-05-02  0:02           ` [dpdk-dev] [PATCH v3 6/6] service: relax barriers with C11 atomics Honnappa Nagarahalli
@ 2020-05-05 14:48             ` Van Haaren, Harry
  0 siblings, 0 replies; 219+ messages in thread
From: Van Haaren, Harry @ 2020-05-05 14:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, phil.yang
  Cc: thomas, david.marchand, Ananyev, Konstantin, jerinj,
	hemant.agrawal, Eads, Gage, Richardson, Bruce, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Saturday, May 2, 2020 1:03 AM
> To: dev@dpdk.org; phil.yang@arm.com; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: thomas@monjalon.net; david.marchand@redhat.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Eads, Gage <gage.eads@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com;
> nd@arm.com
> Subject: [PATCH v3 6/6] service: relax barriers with C11 atomics
> 
> From: Phil Yang <phil.yang@arm.com>
> 
> The runstate, comp_runstate and app_runstate are used as guard variables
> in the service core lib. To guarantee the inter-threads visibility of
> these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
> built-ins to relax these barriers.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Harry van Haaren <harry.van.haaren@intel.com>

<snip diff>

^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib
  2020-04-23 16:31       ` [dpdk-dev] [PATCH v2 0/6] use c11 atomics for service core lib Phil Yang
                           ` (6 preceding siblings ...)
  2020-05-02  0:02         ` [dpdk-dev] [PATCH v3 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
@ 2020-05-05 21:17         ` Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
                             ` (6 more replies)
  7 siblings, 7 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
Using c11 atomics with explicit memory ordering instead of the rte_atomic
ops and rte_smp barriers for inter-threads synchronization can uplift the
performance on aarch64 and no performance loss on x86.

This patchset contains:
1) fix race condition for MT unsafe service.
2) clean up redundant code.
3) use c11 atomics for service core lib to avoid unnecessary barriers.

v2:
Still waiting on Harry for the final solution on the MT unsafe race
condition issue. But I have incorporated the comments so far.
1. add 'Fixes' tag for bug-fix patches.
2. remove 'Fixes' tag for code cleanup patches.
3. remove unused parameter for service_dump_one function.
4. replace the execute_lock atomic CAS operation to spinlock_try_lock.
5. use c11 atomics with RELAXED memory ordering for num_mapped_cores.
6. relax barriers for guard variables runstate, comp_runstate and
   app_runstate with c11 one-way barriers.

v3:
Sending this version since Phil is on holiday.
1. Updated the API documentation to indicate how the locking
   can be avoided.

v4:
1. Fix the nits in 2/6 commit message and comments in code.

Honnappa Nagarahalli (2):
  service: fix race condition for MT unsafe service
  service: fix identification of service running on other lcore

Phil Yang (4):
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 atomics
  service: relax barriers with C11 atomics

 lib/librte_eal/common/rte_service.c           | 234 ++++++++++--------
 lib/librte_eal/include/rte_service.h          |   8 +-
 .../include/rte_service_component.h           |   6 +-
 lib/librte_eal/meson.build                    |   4 +
 4 files changed, 141 insertions(+), 111 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 1/6] service: fix race condition for MT unsafe service
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 2/6] service: fix identification of service running on other lcore Honnappa Nagarahalli
                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd, stable

The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

If the service is known to be mmapped on a single lcore,
setting the service capability to MT safe will avoid taking
the lock and improve the performance.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c            | 11 +++++------
 lib/librte_eal/include/rte_service.h           |  8 ++++++--
 lib/librte_eal/include/rte_service_component.h |  6 +++++-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5d7..b8c465eb9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index d8701dd4c..3a1c735c5 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -104,12 +104,16 @@ int32_t rte_service_probe_capability(uint32_t id, uint32_t capability);
  * Each core can be added or removed from running a specific service. This
  * function enables or disables *lcore* to run *service_id*.
  *
- * If multiple cores are enabled on a service, an atomic is used to ensure that
- * only one cores runs the service at a time. The exception to this is when
+ * If multiple cores are enabled on a service, a lock is used to ensure that
+ * only one core runs the service at a time. The exception to this is when
  * a service indicates that it is multi-thread safe by setting the capability
  * called RTE_SERVICE_CAP_MT_SAFE. With the multi-thread safe capability set,
  * the service function can be run on multiple threads at the same time.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance by avoiding the use of lock.
+ *
  * @param service_id the service to apply the lcore to
  * @param lcore The lcore that will be mapped to service
  * @param enable Zero to unmap or disable the core, non-zero to enable
diff --git a/lib/librte_eal/include/rte_service_component.h b/lib/librte_eal/include/rte_service_component.h
index 16eab79ee..b75aba11b 100644
--- a/lib/librte_eal/include/rte_service_component.h
+++ b/lib/librte_eal/include/rte_service_component.h
@@ -43,7 +43,7 @@ struct rte_service_spec {
 /**
  * Register a new service.
  *
- * A service represents a component that the requires CPU time periodically to
+ * A service represents a component that requires CPU time periodically to
  * achieve its purpose.
  *
  * For example the eventdev SW PMD requires CPU cycles to perform its
@@ -56,6 +56,10 @@ struct rte_service_spec {
  * *rte_service_component_runstate_set*, which indicates that the service
  * component is ready to be executed.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance.
+ *
  * @param spec The specification of the service to register
  * @param[out] service_id A pointer to a uint32_t, which will be filled in
  *             during registration of the service. It is set to the integers
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 2/6] service: fix identification of service running on other lcore
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 3/6] service: remove rte prefix from static functions Honnappa Nagarahalli
                             ` (4 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd, stable

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduce the number of
instructions for all cases.

Cc: stable@dpdk.org
Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b8c465eb9..c283408cf 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ rte_service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to reflect that this core is
+	 * now mapped capable of running the service.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ rte_service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 3/6] service: remove rte prefix from static functions
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 1/6] service: fix race condition for MT unsafe service Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 2/6] service: fix identification of service running on other lcore Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 4/6] service: remove redundant code Honnappa Nagarahalli
                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

clean up rte prefix from static functions.
remove unused parameter for service_dump_one function.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 34 ++++++++++-------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c283408cf..62ea9cbd6 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -340,7 +340,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -378,10 +378,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -425,14 +425,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (cs->runstate == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -693,9 +693,9 @@ rte_service_lcore_start(uint32_t lcore)
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
+	cs->runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -774,13 +774,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
-		     uint64_t all_cycles, uint32_t reset)
+service_dump_one(FILE *f, struct rte_service_spec_impl *s, uint32_t reset)
 {
 	/* avoid divide by zero */
-	if (all_cycles == 0)
-		all_cycles = 1;
-
 	int calls = 1;
 	if (s->calls != 0)
 		calls = s->calls;
@@ -807,7 +803,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, reset);
 	return 0;
 }
 
@@ -851,21 +847,13 @@ rte_service_dump(FILE *f, uint32_t id)
 	uint32_t i;
 	int print_one = (id != UINT32_MAX);
 
-	uint64_t total_cycles = 0;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
-			continue;
-		total_cycles += rte_services[i].cycles_spent;
-	}
-
 	/* print only the specified service */
 	if (print_one) {
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, reset);
 		return 0;
 	}
 
@@ -875,7 +863,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 4/6] service: remove redundant code
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (2 preceding siblings ...)
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 3/6] service: remove rte prefix from static functions Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The service id validation is duplicated, remove the redundant code
in the calling functions.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 62ea9cbd6..37c16c4bc 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -541,24 +541,12 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
-		return -EINVAL;
-
-	if (!lcore_states[lcore].is_service_core)
+	/* validate ID, or return error value */
+	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
+	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -587,19 +575,15 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 5/6] service: optimize with c11 atomics
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (3 preceding siblings ...)
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 4/6] service: remove redundant code Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-06 10:20             ` Phil Yang
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 6/6] service: relax barriers with C11 atomics Honnappa Nagarahalli
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
  6 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The num_mapped_cores is used as a statistics. Use c11 atomics with
RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
enforce unnessary barriers on aarch64.

Replace execute_lock operations to spinlock_try_lock to avoid duplicate
code.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++-------------
 lib/librte_eal/meson.build          |  4 ++++
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 37c16c4bc..5d35f8a8d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -20,6 +20,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_malloc.h>
+#include <rte_spinlock.h>
 
 #include "eal_private.h"
 
@@ -38,11 +39,11 @@ struct rte_service_spec_impl {
 	/* public part of the struct */
 	struct rte_service_spec spec;
 
-	/* atomic lock that when set indicates a service core is currently
+	/* spin lock that when set indicates a service core is currently
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	rte_spinlock_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +55,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	uint32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +333,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +377,11 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		if (!rte_spinlock_trylock(&s->execute_lock))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		rte_spinlock_unlock(&s->execute_lock);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +417,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to reflect that this core is
 	 * now mapped capable of running the service.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	return ret;
 }
@@ -556,19 +558,19 @@ service_update(uint32_t sid, uint32_t lcore,
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -616,7 +618,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -699,7 +702,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 0267c3b9d..c2d7a6954 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -21,3 +21,7 @@ endif
 if cc.has_header('getopt.h')
 	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-DHAVE_GETOPT_LONG']
 endif
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 6/6] service: relax barriers with C11 atomics
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (4 preceding siblings ...)
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
@ 2020-05-05 21:17           ` Honnappa Nagarahalli
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
  6 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-05 21:17 UTC (permalink / raw)
  To: dev, phil.yang, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	honnappa.nagarahalli, nd

From: Phil Yang <phil.yang@arm.com>

The runstate, comp_runstate and app_runstate are used as guard variables
in the service core lib. To guarantee the inter-threads visibility of
these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
built-ins to relax these barriers.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 115 ++++++++++++++++++++--------
 1 file changed, 84 insertions(+), 31 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 5d35f8a8d..3bae7d66d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -265,7 +265,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
 	rte_service_count++;
 
 	if (id_ptr)
@@ -282,7 +281,6 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
@@ -301,12 +299,17 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* comp_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run and service_runstate_get function.
+	 */
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,12 +319,17 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* app_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run runstate_get function.
+	 */
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -330,15 +338,24 @@ rte_service_runstate_get(uint32_t id)
 {
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
-	rte_smp_rmb();
 
-	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING &&
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
+		int check_disabled = !(s->internal_flags &
+					SERVICE_F_START_CHECK);
+		int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
 					    __ATOMIC_RELAXED) > 0);
 
-	return (s->app_runstate == RUNSTATE_RUNNING) &&
-		(s->comp_runstate == RUNSTATE_RUNNING) &&
-		(check_disabled | lcore_mapped);
+		return (check_disabled | lcore_mapped);
+	} else
+		return 0;
+
 }
 
 static inline void
@@ -367,9 +384,15 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	if (!s)
 		return -EINVAL;
 
-	if (s->comp_runstate != RUNSTATE_RUNNING ||
-			s->app_runstate != RUNSTATE_RUNNING ||
-			!(service_mask & (UINT64_C(1) << i))) {
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		!(service_mask & (UINT64_C(1) << i))) {
 		cs->service_active_on_lcore[i] = 0;
 		return -ENOEXEC;
 	}
@@ -434,7 +457,12 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (cs->runstate == RUNSTATE_RUNNING) {
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	while (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -445,8 +473,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -614,15 +640,18 @@ rte_service_lcore_reset_all(void)
 		if (lcore_states[i].is_service_core) {
 			lcore_states[i].service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
-			lcore_states[i].runstate = RUNSTATE_STOPPED;
+			/* runstate act as guard variable Use
+			 * store-release memory order here to synchronize
+			 * with load-acquire in runstate read functions.
+			 */
+			__atomic_store_n(&lcore_states[i].runstate,
+				RUNSTATE_STOPPED, __ATOMIC_RELEASE);
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
 		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
 				    __ATOMIC_RELAXED);
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -638,9 +667,11 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
@@ -655,7 +686,12 @@ rte_service_lcore_del(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate != RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_STOPPED)
 		return -EBUSY;
 
 	set_lcore_state(lcore, ROLE_RTE);
@@ -674,13 +710,21 @@ rte_service_lcore_start(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate == RUNSTATE_RUNNING)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING)
 		return -EALREADY;
 
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	cs->runstate = RUNSTATE_RUNNING;
+	/* Use load-acquire memory order here to synchronize with
+	 * store-release in runstate update functions.
+	 */
+	__atomic_store_n(&cs->runstate, RUNSTATE_RUNNING, __ATOMIC_RELEASE);
 
 	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
@@ -693,7 +737,12 @@ rte_service_lcore_stop(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	if (lcore_states[lcore].runstate == RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&lcore_states[lcore].runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
@@ -713,7 +762,11 @@ rte_service_lcore_stop(uint32_t lcore)
 			return -EBUSY;
 	}
 
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 5/6] service: optimize with c11 atomics
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 5/6] service: optimize with c11 atomics Honnappa Nagarahalli
@ 2020-05-06 10:20             ` Phil Yang
  0 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:20 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa Nagarahalli, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Wednesday, May 6, 2020 5:18 AM
> To: dev@dpdk.org; Phil Yang <Phil.Yang@arm.com>;
> harry.van.haaren@intel.com
> Cc: thomas@monjalon.net; david.marchand@redhat.com;
> konstantin.ananyev@intel.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; gage.eads@intel.com;
> bruce.richardson@intel.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: [PATCH v4 5/6] service: optimize with c11 atomics
> 
> From: Phil Yang <phil.yang@arm.com>
> 
> The num_mapped_cores is used as a statistics. Use c11 atomics with
> RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
> enforce unnessary barriers on aarch64.
> 
> Replace execute_lock operations to spinlock_try_lock to avoid duplicate
> code.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
> ---
>  lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++-------------
>  lib/librte_eal/meson.build          |  4 ++++
>  2 files changed, 22 insertions(+), 14 deletions(-)
> 
<snip>

> diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
> index 0267c3b9d..c2d7a6954 100644
> --- a/lib/librte_eal/meson.build
> +++ b/lib/librte_eal/meson.build
> @@ -21,3 +21,7 @@ endif
>  if cc.has_header('getopt.h')
>  	cflags += ['-DHAVE_GETOPT_H', '-DHAVE_GETOPT', '-
> DHAVE_GETOPT_LONG']
>  endif
> +# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
> +if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
> +    ext_deps += cc.find_library('atomic')
> +endif

We can remove this as it has been added in global. 
"da4eae278b56 - build: add global libatomic dependency for 32-bit clang"

I've updated it in v5.

Thanks,
Phil
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib
  2020-05-05 21:17         ` [dpdk-dev] [PATCH v4 0/6] use c11 atomics for service core lib Honnappa Nagarahalli
                             ` (5 preceding siblings ...)
  2020-05-05 21:17           ` [dpdk-dev] [PATCH v4 6/6] service: relax barriers with C11 atomics Honnappa Nagarahalli
@ 2020-05-06 10:24           ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 1/6] service: fix race condition for MT unsafe service Phil Yang
                               ` (6 more replies)
  6 siblings, 7 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
Using c11 atomics with explicit memory ordering instead of the rte_atomic
ops and rte_smp barriers for inter-threads synchronization can uplift the
performance on aarch64 and no performance loss on x86.

This patchset contains:
1) fix race condition for MT unsafe service.
2) clean up redundant code.
3) use c11 atomics for service core lib to avoid unnecessary barriers.

v2:
Still waiting on Harry for the final solution on the MT unsafe race
condition issue. But I have incorporated the comments so far.
1. add 'Fixes' tag for bug-fix patches.
2. remove 'Fixes' tag for code cleanup patches.
3. remove unused parameter for service_dump_one function.
4. replace the execute_lock atomic CAS operation to spinlock_try_lock.
5. use c11 atomics with RELAXED memory ordering for num_mapped_cores.
6. relax barriers for guard variables runstate, comp_runstate and
   app_runstate with c11 one-way barriers.

v3:
Sending this version since Phil is on holiday.
1. Updated the API documentation to indicate how the locking
   can be avoided.

v4:
1. Fix the nits in 2/6 commit message and comments in code.

v5:
1. Remove redundant libatomic clang dependency claim code as it has been
added in global. (The commit da4eae278b56)

Honnappa Nagarahalli (2):
  service: fix race condition for MT unsafe service
  service: fix identification of service running on other lcore

Phil Yang (4):
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 atomics
  service: relax barriers with C11 atomics

 lib/librte_eal/common/rte_service.c            | 234 +++++++++++++------------
 lib/librte_eal/include/rte_service.h           |   8 +-
 lib/librte_eal/include/rte_service_component.h |   6 +-
 3 files changed, 137 insertions(+), 111 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 1/6] service: fix race condition for MT unsafe service
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 2/6] service: fix identification of service running on other lcore Phil Yang
                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd, Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

If the service is known to be mmapped on a single lcore,
setting the service capability to MT safe will avoid taking
the lock and improve the performance.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c            | 11 +++++------
 lib/librte_eal/include/rte_service.h           |  8 ++++++--
 lib/librte_eal/include/rte_service_component.h |  6 +++++-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5..b8c465e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index d8701dd..3a1c735 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -104,12 +104,16 @@ int32_t rte_service_probe_capability(uint32_t id, uint32_t capability);
  * Each core can be added or removed from running a specific service. This
  * function enables or disables *lcore* to run *service_id*.
  *
- * If multiple cores are enabled on a service, an atomic is used to ensure that
- * only one cores runs the service at a time. The exception to this is when
+ * If multiple cores are enabled on a service, a lock is used to ensure that
+ * only one core runs the service at a time. The exception to this is when
  * a service indicates that it is multi-thread safe by setting the capability
  * called RTE_SERVICE_CAP_MT_SAFE. With the multi-thread safe capability set,
  * the service function can be run on multiple threads at the same time.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance by avoiding the use of lock.
+ *
  * @param service_id the service to apply the lcore to
  * @param lcore The lcore that will be mapped to service
  * @param enable Zero to unmap or disable the core, non-zero to enable
diff --git a/lib/librte_eal/include/rte_service_component.h b/lib/librte_eal/include/rte_service_component.h
index 16eab79..b75aba1 100644
--- a/lib/librte_eal/include/rte_service_component.h
+++ b/lib/librte_eal/include/rte_service_component.h
@@ -43,7 +43,7 @@ struct rte_service_spec {
 /**
  * Register a new service.
  *
- * A service represents a component that the requires CPU time periodically to
+ * A service represents a component that requires CPU time periodically to
  * achieve its purpose.
  *
  * For example the eventdev SW PMD requires CPU cycles to perform its
@@ -56,6 +56,10 @@ struct rte_service_spec {
  * *rte_service_component_runstate_set*, which indicates that the service
  * component is ready to be executed.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance.
+ *
  * @param spec The specification of the service to register
  * @param[out] service_id A pointer to a uint32_t, which will be filled in
  *             during registration of the service. It is set to the integers
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 2/6] service: fix identification of service running on other lcore
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 1/6] service: fix race condition for MT unsafe service Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 3/6] service: remove rte prefix from static functions Phil Yang
                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd, Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduce the number of
instructions for all cases.

Cc: stable@dpdk.org
Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b8c465e..c283408 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ rte_service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to reflect that this core is
+	 * now mapped capable of running the service.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ rte_service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 3/6] service: remove rte prefix from static functions
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 1/6] service: fix race condition for MT unsafe service Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 2/6] service: fix identification of service running on other lcore Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 4/6] service: remove redundant code Phil Yang
                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

clean up rte prefix from static functions.
remove unused parameter for service_dump_one function.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 34 +++++++++++-----------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c283408..62ea9cb 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -340,7 +340,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -378,10 +378,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -425,14 +425,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (cs->runstate == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -693,9 +693,9 @@ rte_service_lcore_start(uint32_t lcore)
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
+	cs->runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -774,13 +774,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
-		     uint64_t all_cycles, uint32_t reset)
+service_dump_one(FILE *f, struct rte_service_spec_impl *s, uint32_t reset)
 {
 	/* avoid divide by zero */
-	if (all_cycles == 0)
-		all_cycles = 1;
-
 	int calls = 1;
 	if (s->calls != 0)
 		calls = s->calls;
@@ -807,7 +803,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, reset);
 	return 0;
 }
 
@@ -851,21 +847,13 @@ rte_service_dump(FILE *f, uint32_t id)
 	uint32_t i;
 	int print_one = (id != UINT32_MAX);
 
-	uint64_t total_cycles = 0;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
-			continue;
-		total_cycles += rte_services[i].cycles_spent;
-	}
-
 	/* print only the specified service */
 	if (print_one) {
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, reset);
 		return 0;
 	}
 
@@ -875,7 +863,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 4/6] service: remove redundant code
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
                               ` (2 preceding siblings ...)
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 3/6] service: remove rte prefix from static functions Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 5/6] service: optimize with c11 atomics Phil Yang
                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The service id validation is duplicated, remove the redundant code
in the calling functions.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 62ea9cb..37c16c4 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -541,24 +541,12 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
-		return -EINVAL;
-
-	if (!lcore_states[lcore].is_service_core)
+	/* validate ID, or return error value */
+	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
+	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -587,19 +575,15 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 5/6] service: optimize with c11 atomics
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
                               ` (3 preceding siblings ...)
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 4/6] service: remove redundant code Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 6/6] service: relax barriers with C11 atomics Phil Yang
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The num_mapped_cores is used as a statistics. Use c11 atomics with
RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
enforce unnessary barriers on aarch64.

Replace execute_lock operations to spinlock_try_lock to avoid duplicate
code.

Change-Id: I2edf1feb64c9192fc7577be741865ea61c8680cd
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 37c16c4..5d35f8a 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -20,6 +20,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_malloc.h>
+#include <rte_spinlock.h>
 
 #include "eal_private.h"
 
@@ -38,11 +39,11 @@ struct rte_service_spec_impl {
 	/* public part of the struct */
 	struct rte_service_spec spec;
 
-	/* atomic lock that when set indicates a service core is currently
+	/* spin lock that when set indicates a service core is currently
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	rte_spinlock_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +55,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	uint32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +333,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +377,11 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		if (!rte_spinlock_trylock(&s->execute_lock))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		rte_spinlock_unlock(&s->execute_lock);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +417,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to reflect that this core is
 	 * now mapped capable of running the service.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	return ret;
 }
@@ -556,19 +558,19 @@ service_update(uint32_t sid, uint32_t lcore,
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -616,7 +618,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -699,7 +702,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v5 6/6] service: relax barriers with C11 atomics
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
                               ` (4 preceding siblings ...)
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 5/6] service: optimize with c11 atomics Phil Yang
@ 2020-05-06 10:24             ` Phil Yang
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 10:24 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The runstate, comp_runstate and app_runstate are used as guard variables
in the service core lib. To guarantee the inter-threads visibility of
these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
built-ins to relax these barriers.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 115 ++++++++++++++++++++++++++----------
 1 file changed, 84 insertions(+), 31 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 5d35f8a..3bae7d6 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -265,7 +265,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
 	rte_service_count++;
 
 	if (id_ptr)
@@ -282,7 +281,6 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
@@ -301,12 +299,17 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* comp_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run and service_runstate_get function.
+	 */
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,12 +319,17 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* app_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run runstate_get function.
+	 */
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -330,15 +338,24 @@ rte_service_runstate_get(uint32_t id)
 {
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
-	rte_smp_rmb();
 
-	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING &&
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
+		int check_disabled = !(s->internal_flags &
+					SERVICE_F_START_CHECK);
+		int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
 					    __ATOMIC_RELAXED) > 0);
 
-	return (s->app_runstate == RUNSTATE_RUNNING) &&
-		(s->comp_runstate == RUNSTATE_RUNNING) &&
-		(check_disabled | lcore_mapped);
+		return (check_disabled | lcore_mapped);
+	} else
+		return 0;
+
 }
 
 static inline void
@@ -367,9 +384,15 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	if (!s)
 		return -EINVAL;
 
-	if (s->comp_runstate != RUNSTATE_RUNNING ||
-			s->app_runstate != RUNSTATE_RUNNING ||
-			!(service_mask & (UINT64_C(1) << i))) {
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		!(service_mask & (UINT64_C(1) << i))) {
 		cs->service_active_on_lcore[i] = 0;
 		return -ENOEXEC;
 	}
@@ -434,7 +457,12 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (cs->runstate == RUNSTATE_RUNNING) {
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	while (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -445,8 +473,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -614,15 +640,18 @@ rte_service_lcore_reset_all(void)
 		if (lcore_states[i].is_service_core) {
 			lcore_states[i].service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
-			lcore_states[i].runstate = RUNSTATE_STOPPED;
+			/* runstate act as guard variable Use
+			 * store-release memory order here to synchronize
+			 * with load-acquire in runstate read functions.
+			 */
+			__atomic_store_n(&lcore_states[i].runstate,
+				RUNSTATE_STOPPED, __ATOMIC_RELEASE);
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
 		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
 				    __ATOMIC_RELAXED);
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -638,9 +667,11 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
@@ -655,7 +686,12 @@ rte_service_lcore_del(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate != RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_STOPPED)
 		return -EBUSY;
 
 	set_lcore_state(lcore, ROLE_RTE);
@@ -674,13 +710,21 @@ rte_service_lcore_start(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate == RUNSTATE_RUNNING)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING)
 		return -EALREADY;
 
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	cs->runstate = RUNSTATE_RUNNING;
+	/* Use load-acquire memory order here to synchronize with
+	 * store-release in runstate update functions.
+	 */
+	__atomic_store_n(&cs->runstate, RUNSTATE_RUNNING, __ATOMIC_RELEASE);
 
 	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
@@ -693,7 +737,12 @@ rte_service_lcore_stop(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	if (lcore_states[lcore].runstate == RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&lcore_states[lcore].runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
@@ -713,7 +762,11 @@ rte_service_lcore_stop(uint32_t lcore)
 			return -EBUSY;
 	}
 
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib
  2020-05-06 10:24           ` [dpdk-dev] [PATCH v5 0/6] use c11 atomics for service core lib Phil Yang
                               ` (5 preceding siblings ...)
  2020-05-06 10:24             ` [dpdk-dev] [PATCH v5 6/6] service: relax barriers with C11 atomics Phil Yang
@ 2020-05-06 15:27             ` Phil Yang
  2020-05-06 15:27               ` [dpdk-dev] [PATCH v6 1/6] service: fix race condition for MT unsafe service Phil Yang
                                 ` (6 more replies)
  6 siblings, 7 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:27 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
Using c11 atomics with explicit memory ordering instead of the rte_atomic
ops and rte_smp barriers for inter-threads synchronization can uplift the
performance on aarch64 and no performance loss on x86.

This patchset contains:
1) fix race condition for MT unsafe service.
2) clean up redundant code.
3) use c11 atomics for service core lib to avoid unnecessary barriers.

v2:
Still waiting on Harry for the final solution on the MT unsafe race
condition issue. But I have incorporated the comments so far.
1. add 'Fixes' tag for bug-fix patches.
2. remove 'Fixes' tag for code cleanup patches.
3. remove unused parameter for service_dump_one function.
4. replace the execute_lock atomic CAS operation to spinlock_try_lock.
5. use c11 atomics with RELAXED memory ordering for num_mapped_cores.
6. relax barriers for guard variables runstate, comp_runstate and
   app_runstate with c11 one-way barriers.

v3:
Sending this version since Phil is on holiday.
1. Updated the API documentation to indicate how the locking
   can be avoided.

v4:
1. Fix the nits in 2/6 commit message and comments in code.

v5:
1. Remove redundant libatomic clang dependency claim code as it has been
added in global. (The commit da4eae278b56)

v6:
1. Fix coding style issue. Remove illegal Change-ID tag in patch 5/6.

Honnappa Nagarahalli (2):
  service: fix race condition for MT unsafe service
  service: fix identification of service running on other lcore

Phil Yang (4):
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 atomics
  service: relax barriers with C11 atomics

 lib/librte_eal/common/rte_service.c            | 234 +++++++++++++------------
 lib/librte_eal/include/rte_service.h           |   8 +-
 lib/librte_eal/include/rte_service_component.h |   6 +-
 3 files changed, 137 insertions(+), 111 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 1/6] service: fix race condition for MT unsafe service
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
@ 2020-05-06 15:27               ` Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 2/6] service: fix identification of service running on other lcore Phil Yang
                                 ` (5 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:27 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd, Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The MT unsafe service might get configured to run on another core
while the service is running currently. This might result in the
MT unsafe service running on multiple cores simultaneously. Use
'execute_lock' always when the service is MT unsafe.

If the service is known to be mmapped on a single lcore,
setting the service capability to MT safe will avoid taking
the lock and improve the performance.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c            | 11 +++++------
 lib/librte_eal/include/rte_service.h           |  8 ++++++--
 lib/librte_eal/include/rte_service_component.h |  6 +++++-
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 70d17a5..b8c465e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
diff --git a/lib/librte_eal/include/rte_service.h b/lib/librte_eal/include/rte_service.h
index d8701dd..3a1c735 100644
--- a/lib/librte_eal/include/rte_service.h
+++ b/lib/librte_eal/include/rte_service.h
@@ -104,12 +104,16 @@ int32_t rte_service_probe_capability(uint32_t id, uint32_t capability);
  * Each core can be added or removed from running a specific service. This
  * function enables or disables *lcore* to run *service_id*.
  *
- * If multiple cores are enabled on a service, an atomic is used to ensure that
- * only one cores runs the service at a time. The exception to this is when
+ * If multiple cores are enabled on a service, a lock is used to ensure that
+ * only one core runs the service at a time. The exception to this is when
  * a service indicates that it is multi-thread safe by setting the capability
  * called RTE_SERVICE_CAP_MT_SAFE. With the multi-thread safe capability set,
  * the service function can be run on multiple threads at the same time.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance by avoiding the use of lock.
+ *
  * @param service_id the service to apply the lcore to
  * @param lcore The lcore that will be mapped to service
  * @param enable Zero to unmap or disable the core, non-zero to enable
diff --git a/lib/librte_eal/include/rte_service_component.h b/lib/librte_eal/include/rte_service_component.h
index 16eab79..b75aba1 100644
--- a/lib/librte_eal/include/rte_service_component.h
+++ b/lib/librte_eal/include/rte_service_component.h
@@ -43,7 +43,7 @@ struct rte_service_spec {
 /**
  * Register a new service.
  *
- * A service represents a component that the requires CPU time periodically to
+ * A service represents a component that requires CPU time periodically to
  * achieve its purpose.
  *
  * For example the eventdev SW PMD requires CPU cycles to perform its
@@ -56,6 +56,10 @@ struct rte_service_spec {
  * *rte_service_component_runstate_set*, which indicates that the service
  * component is ready to be executed.
  *
+ * If the service is known to be mapped to a single lcore, setting the
+ * capability of the service to RTE_SERVICE_CAP_MT_SAFE can achieve
+ * better performance.
+ *
  * @param spec The specification of the service to register
  * @param[out] service_id A pointer to a uint32_t, which will be filled in
  *             during registration of the service. It is set to the integers
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 2/6] service: fix identification of service running on other lcore
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
  2020-05-06 15:27               ` [dpdk-dev] [PATCH v6 1/6] service: fix race condition for MT unsafe service Phil Yang
@ 2020-05-06 15:28               ` Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 3/6] service: remove rte prefix from static functions Phil Yang
                                 ` (4 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:28 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd, Honnappa Nagarahalli, stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduce the number of
instructions for all cases.

Cc: stable@dpdk.org
Fixes: 8d39d3e237c2 ("service: fix race in service on app lcore function")

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b8c465e..c283408 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ rte_service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to reflect that this core is
+	 * now mapped capable of running the service.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ rte_service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 3/6] service: remove rte prefix from static functions
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
  2020-05-06 15:27               ` [dpdk-dev] [PATCH v6 1/6] service: fix race condition for MT unsafe service Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 2/6] service: fix identification of service running on other lcore Phil Yang
@ 2020-05-06 15:28               ` Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 4/6] service: remove redundant code Phil Yang
                                 ` (3 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:28 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

clean up rte prefix from static functions.
remove unused parameter for service_dump_one function.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 34 +++++++++++-----------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c283408..62ea9cb 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -340,7 +340,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -378,10 +378,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -425,14 +425,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (cs->runstate == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -693,9 +693,9 @@ rte_service_lcore_start(uint32_t lcore)
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
+	cs->runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -774,13 +774,9 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
-		     uint64_t all_cycles, uint32_t reset)
+service_dump_one(FILE *f, struct rte_service_spec_impl *s, uint32_t reset)
 {
 	/* avoid divide by zero */
-	if (all_cycles == 0)
-		all_cycles = 1;
-
 	int calls = 1;
 	if (s->calls != 0)
 		calls = s->calls;
@@ -807,7 +803,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, reset);
 	return 0;
 }
 
@@ -851,21 +847,13 @@ rte_service_dump(FILE *f, uint32_t id)
 	uint32_t i;
 	int print_one = (id != UINT32_MAX);
 
-	uint64_t total_cycles = 0;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if (!service_valid(i))
-			continue;
-		total_cycles += rte_services[i].cycles_spent;
-	}
-
 	/* print only the specified service */
 	if (print_one) {
 		struct rte_service_spec_impl *s;
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, reset);
 		return 0;
 	}
 
@@ -875,7 +863,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 4/6] service: remove redundant code
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
                                 ` (2 preceding siblings ...)
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 3/6] service: remove rte prefix from static functions Phil Yang
@ 2020-05-06 15:28               ` Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 5/6] service: optimize with c11 atomics Phil Yang
                                 ` (2 subsequent siblings)
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:28 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The service id validation is duplicated, remove the redundant code
in the calling functions.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 62ea9cb..37c16c4 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -541,24 +541,12 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
-		return -EINVAL;
-
-	if (!lcore_states[lcore].is_service_core)
+	/* validate ID, or return error value */
+	if (sid >= RTE_SERVICE_NUM_MAX || !service_valid(sid) ||
+	    lcore >= RTE_MAX_LCORE || !lcore_states[lcore].is_service_core)
 		return -EINVAL;
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
@@ -587,19 +575,15 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 5/6] service: optimize with c11 atomics
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
                                 ` (3 preceding siblings ...)
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 4/6] service: remove redundant code Phil Yang
@ 2020-05-06 15:28               ` Phil Yang
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 6/6] service: relax barriers with C11 atomics Phil Yang
  2020-05-11 11:21               ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib David Marchand
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:28 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The num_mapped_cores is used as a statistics. Use c11 atomics with
RELAXED ordering for num_mapped_cores instead of rte_atomic ops which
enforce unnessary barriers on aarch64.

Replace execute_lock operations to spinlock_try_lock to avoid duplicate
code.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 37c16c4..5d35f8a 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -20,6 +20,7 @@
 #include <rte_atomic.h>
 #include <rte_memory.h>
 #include <rte_malloc.h>
+#include <rte_spinlock.h>
 
 #include "eal_private.h"
 
@@ -38,11 +39,11 @@ struct rte_service_spec_impl {
 	/* public part of the struct */
 	struct rte_service_spec spec;
 
-	/* atomic lock that when set indicates a service core is currently
+	/* spin lock that when set indicates a service core is currently
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	rte_spinlock_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +55,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	uint32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +333,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +377,11 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		if (!rte_spinlock_trylock(&s->execute_lock))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		rte_spinlock_unlock(&s->execute_lock);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +417,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to reflect that this core is
 	 * now mapped capable of running the service.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELAXED);
 
 	return ret;
 }
@@ -556,19 +558,19 @@ service_update(uint32_t sid, uint32_t lcore,
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELAXED);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -616,7 +618,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -699,7 +702,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v6 6/6] service: relax barriers with C11 atomics
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
                                 ` (4 preceding siblings ...)
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 5/6] service: optimize with c11 atomics Phil Yang
@ 2020-05-06 15:28               ` Phil Yang
  2020-05-11 11:21               ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib David Marchand
  6 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-06 15:28 UTC (permalink / raw)
  To: dev, harry.van.haaren
  Cc: thomas, david.marchand, konstantin.ananyev, jerinj,
	hemant.agrawal, gage.eads, bruce.richardson,
	Honnappa.Nagarahalli, nd

The runstate, comp_runstate and app_runstate are used as guard variables
in the service core lib. To guarantee the inter-threads visibility of
these guard variables, it uses rte_smp_r/wmb. This patch use c11 atomic
built-ins to relax these barriers.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>
---
 lib/librte_eal/common/rte_service.c | 115 ++++++++++++++++++++++++++----------
 1 file changed, 84 insertions(+), 31 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 5d35f8a..3bae7d6 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -265,7 +265,6 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
 	rte_service_count++;
 
 	if (id_ptr)
@@ -282,7 +281,6 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
 	s->internal_flags &= ~(SERVICE_F_REGISTERED);
 
@@ -301,12 +299,17 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* comp_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run and service_runstate_get function.
+	 */
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,12 +319,17 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
+	/* app_runstate act as the guard variable. Use store-release
+	 * memory order. This synchronizes with load-acquire in
+	 * service_run runstate_get function.
+	 */
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -330,15 +338,24 @@ rte_service_runstate_get(uint32_t id)
 {
 	struct rte_service_spec_impl *s;
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
-	rte_smp_rmb();
 
-	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING &&
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
+		int check_disabled = !(s->internal_flags &
+					SERVICE_F_START_CHECK);
+		int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
 					    __ATOMIC_RELAXED) > 0);
 
-	return (s->app_runstate == RUNSTATE_RUNNING) &&
-		(s->comp_runstate == RUNSTATE_RUNNING) &&
-		(check_disabled | lcore_mapped);
+		return (check_disabled | lcore_mapped);
+	} else
+		return 0;
+
 }
 
 static inline void
@@ -367,9 +384,15 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	if (!s)
 		return -EINVAL;
 
-	if (s->comp_runstate != RUNSTATE_RUNNING ||
-			s->app_runstate != RUNSTATE_RUNNING ||
-			!(service_mask & (UINT64_C(1) << i))) {
+	/* comp_runstate and app_runstate act as the guard variables.
+	 * Use load-acquire memory order. This synchronizes with
+	 * store-release in service state set functions.
+	 */
+	if (__atomic_load_n(&s->comp_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		 __atomic_load_n(&s->app_runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_RUNNING ||
+		!(service_mask & (UINT64_C(1) << i))) {
 		cs->service_active_on_lcore[i] = 0;
 		return -ENOEXEC;
 	}
@@ -434,7 +457,12 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (cs->runstate == RUNSTATE_RUNNING) {
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	while (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -445,8 +473,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -614,15 +640,18 @@ rte_service_lcore_reset_all(void)
 		if (lcore_states[i].is_service_core) {
 			lcore_states[i].service_mask = 0;
 			set_lcore_state(i, ROLE_RTE);
-			lcore_states[i].runstate = RUNSTATE_STOPPED;
+			/* runstate act as guard variable Use
+			 * store-release memory order here to synchronize
+			 * with load-acquire in runstate read functions.
+			 */
+			__atomic_store_n(&lcore_states[i].runstate,
+				RUNSTATE_STOPPED, __ATOMIC_RELEASE);
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
 		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
 				    __ATOMIC_RELAXED);
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -638,9 +667,11 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
@@ -655,7 +686,12 @@ rte_service_lcore_del(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate != RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) != RUNSTATE_STOPPED)
 		return -EBUSY;
 
 	set_lcore_state(lcore, ROLE_RTE);
@@ -674,13 +710,21 @@ rte_service_lcore_start(uint32_t lcore)
 	if (!cs->is_service_core)
 		return -EINVAL;
 
-	if (cs->runstate == RUNSTATE_RUNNING)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&cs->runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_RUNNING)
 		return -EALREADY;
 
 	/* set core to run state first, and then launch otherwise it will
 	 * return immediately as runstate keeps it in the service poll loop
 	 */
-	cs->runstate = RUNSTATE_RUNNING;
+	/* Use load-acquire memory order here to synchronize with
+	 * store-release in runstate update functions.
+	 */
+	__atomic_store_n(&cs->runstate, RUNSTATE_RUNNING, __ATOMIC_RELEASE);
 
 	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
@@ -693,7 +737,12 @@ rte_service_lcore_stop(uint32_t lcore)
 	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
-	if (lcore_states[lcore].runstate == RUNSTATE_STOPPED)
+	/* runstate act as the guard variable. Use load-acquire
+	 * memory order here to synchronize with store-release
+	 * in runstate update functions.
+	 */
+	if (__atomic_load_n(&lcore_states[lcore].runstate,
+			__ATOMIC_ACQUIRE) == RUNSTATE_STOPPED)
 		return -EALREADY;
 
 	uint32_t i;
@@ -713,7 +762,11 @@ rte_service_lcore_stop(uint32_t lcore)
 			return -EBUSY;
 	}
 
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
+	/* Use store-release memory order here to synchronize with
+	 * load-acquire in runstate read functions.
+	 */
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+		__ATOMIC_RELEASE);
 
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v3] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-04-24 11:17           ` Ananyev, Konstantin
@ 2020-05-09 21:51             ` Akhil Goyal
  0 siblings, 0 replies; 219+ messages in thread
From: Akhil Goyal @ 2020-05-09 21:51 UTC (permalink / raw)
  To: Ananyev, Konstantin, Phil Yang, dev
  Cc: thomas, jerinj, Iremonger, Bernard, Medvedkin, Vladimir,
	Honnappa.Nagarahalli, gavin.hu, ruifeng.wang, nd


> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. Use c11 atomics with RELAXED ordering for outbound SQN
> > update instead of rte_atomic ops which enforce unnecessary barriers on
> > aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> 
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 
Applied to dpdk-next-crypto

Thanks.


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib
  2020-05-06 15:27             ` [dpdk-dev] [PATCH v6 0/6] use c11 atomics for service core lib Phil Yang
                                 ` (5 preceding siblings ...)
  2020-05-06 15:28               ` [dpdk-dev] [PATCH v6 6/6] service: relax barriers with C11 atomics Phil Yang
@ 2020-05-11 11:21               ` David Marchand
  6 siblings, 0 replies; 219+ messages in thread
From: David Marchand @ 2020-05-11 11:21 UTC (permalink / raw)
  To: Phil Yang
  Cc: dev, Van Haaren Harry, Thomas Monjalon, Ananyev, Konstantin,
	Jerin Jacob Kollanukkaran, Hemant Agrawal, Gage Eads,
	Bruce Richardson, Honnappa Nagarahalli, nd

On Wed, May 6, 2020 at 5:28 PM Phil Yang <phil.yang@arm.com> wrote:
>
> The rte_atomic ops and rte_smp barriers enforce DMB barriers on aarch64.
> Using c11 atomics with explicit memory ordering instead of the rte_atomic
> ops and rte_smp barriers for inter-threads synchronization can uplift the
> performance on aarch64 and no performance loss on x86.
>
> This patchset contains:
> 1) fix race condition for MT unsafe service.
> 2) clean up redundant code.
> 3) use c11 atomics for service core lib to avoid unnecessary barriers.

Series applied, thanks for the cleanup and fixes.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 0/4] generic rte atomic APIs deprecate proposal
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (13 preceding siblings ...)
  2020-04-03  7:23     ` Mattias Rönnblom
@ 2020-05-12  8:03     ` " Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 1/4] doc: add generic atomic deprecation section Phil Yang
                         ` (5 more replies)
  14 siblings, 6 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:03 UTC (permalink / raw)
  To: thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, phil.yang, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [2] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) changes to programmer guide describing writing efficient code for aarch64.
2) the checkpatch script changes to flag rte_atomicNN_xxx API usage in patches.
3) wraps up compiler __atomic built-ins with explicit memory ordering parameter.

v4:
1. add reader-writer concurrency case describing.
2. claim maintainership of c11 atomics code for each platforms.
3. flag rte_atomicNN_xxx in new patches for modules that have been converted to
c11 style.
4. flag __sync_xxx built-ins in new patches.
5. wraps up compiler atomic built-ins
6. move the changes of libraries which make use of c11 atomic APIs out of this
patchset.

v3:
add libatomic dependency for 32-bit clang

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.


Phil Yang (4):
  doc: add generic atomic deprecation section
  maintainers: claim maintainers of c11 atomics code
  devtools: prevent use of rte atomic APIs in future patches
  eal/atomic: add wrapper for c11 atomics

 MAINTAINERS                                      |   4 +
 devtools/checkpatches.sh                         |  23 ++++
 doc/guides/prog_guide/writing_efficient_code.rst | 139 ++++++++++++++++++++++-
 lib/librte_eal/include/generic/rte_atomic_c11.h  | 139 +++++++++++++++++++++++
 lib/librte_eal/include/meson.build               |   1 +
 5 files changed, 305 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 1/4] doc: add generic atomic deprecation section
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
@ 2020-05-12  8:03       ` Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 2/4] maintainers: claim maintainers of c11 atomics code Phil Yang
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:03 UTC (permalink / raw)
  To: thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, phil.yang, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 139 ++++++++++++++++++++++-
 1 file changed, 138 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..3bd2601 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,137 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
+Reader-Writer Concurrency
+^^^^^^^^^^^^^^^^^^^^^^^^^
+Lock-free reader-writer concurrency is one of the common use cases in DPDK.
+
+The payload or the data that the writer wants to communicate to the reader,
+can be written with RELAXED memory order. However, the guard variable should
+be written with RELEASE memory order. This ensures that the store to guard
+variable is observable only after the store to payload is observable.
+Refer to `rte_hash insert <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L737>`_ for an example.
+
+.. code-block:: c
+
+    static inline int32_t
+    rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
+        ...
+        int32_t *ret_val)
+    {
+        ...
+        ...
+
+        /* Insert new entry if there is room in the primary
+         * bucket.
+         */
+        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+                /* Check if slot is available */
+                if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
+                        prim_bkt->sig_current[i] = sig;
+                        /* Store to signature and key should not
+                         * leak after the store to key_idx. i.e.
+                         * key_idx is the guard variable for signature
+                         * and key.
+                         */
+                        __atomic_store_n(&prim_bkt->key_idx[i],
+                                         new_idx,
+                                         __ATOMIC_RELEASE);
+                        break;
+                }
+        }
+
+        ...
+    }
+
+Correspondingly, on the reader side, the guard variable should be read
+with ACQUIRE memory order. The payload or the data the writer communicated,
+can be read with RELAXED memory order. This ensures that, if the store to
+guard variable is observable, the store to payload is also observable. Refer to `rte_hash lookup <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_hash/rte_cuckoo_hash.c#L1215>`_ for an example.
+
+.. code-block:: c
+
+    static inline int32_t
+    search_one_bucket_lf(const struct rte_hash *h, const void *key, uint16_t sig,
+        void **data, const struct rte_hash_bucket *bkt)
+    {
+        ...
+
+        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
+            ....
+            if (bkt->sig_current[i] == sig) {
+                key_idx = __atomic_load_n(&bkt->key_idx[i],
+                                        __ATOMIC_ACQUIRE);
+                if (key_idx != EMPTY_SLOT) {
+                    k = (struct rte_hash_key *) ((char *)keys +
+                        key_idx * h->key_entry_size);
+
+                if (rte_hash_cmp_eq(key, k->key, h) == 0) {
+                    if (data != NULL) {
+                        *data = __atomic_load_n(&k->pdata,
+                                        __ATOMIC_ACQUIRE);
+                    }
+
+                    /*
+                    * Return index where key is stored,
+                    * subtracting the first dummy index
+                    */
+                    return key_idx - 1;
+                }
+            ...
+    }
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 2/4] maintainers: claim maintainers of c11 atomics code
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 1/4] doc: add generic atomic deprecation section Phil Yang
@ 2020-05-12  8:03       ` Phil Yang
  2020-05-24 23:11         ` Thomas Monjalon
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 3/4] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:03 UTC (permalink / raw)
  To: thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, phil.yang, nd

Add the maintainership of c11 atomics code.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 MAINTAINERS | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 6a14622..4435ae5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -266,6 +266,10 @@ F: lib/librte_eal/include/rte_random.h
 F: lib/librte_eal/common/rte_random.c
 F: app/test/test_rand_perf.c
 
+C11 Code Maintainer
+M: Honnappa Nagarahalli <honnappa,nagarahalli.arm.com>
+M: David Christensen <drc@linux.vnet.ibm.com>
+
 ARM v7
 M: Jan Viktorin <viktorin@rehivetech.com>
 M: Ruifeng Wang <ruifeng.wang@arm.com>
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 3/4] devtools: prevent use of rte atomic APIs in future patches
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 1/4] doc: add generic atomic deprecation section Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 2/4] maintainers: claim maintainers of c11 atomics code Phil Yang
@ 2020-05-12  8:03       ` Phil Yang
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics Phil Yang
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:03 UTC (permalink / raw)
  To: thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, phil.yang, nd

In order to deprecate the rte_atomic APIs, prevent the patches from
using rte_atomic APIs in the converted modules and compilers __sync
built-ins in all modules.

The converted modules:
lib/librte_distributor
lib/librte_hash
lib/librte_kni
lib/librte_lpm
lib/librte_rcu
lib/librte_ring
lib/librte_stack
lib/librte_vhost
lib/librte_timer
lib/librte_ipsec
drivers/event/octeontx
drivers/event/octeontx2
drivers/event/opdl
drivers/net/bnx2x
drivers/net/hinic
drivers/net/hns3
drivers/net/memif
drivers/net/thunderx
drivers/net/virtio
examples/l2fwd-event

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 42b833e..002586d 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -69,6 +69,29 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib/librte_distributor lib/librte_hash lib/librte_kni
+			lib/librte_lpm lib/librte_rcu lib/librte_ring
+			lib/librte_stack lib/librte_vhost drivers/event/octeontx
+			drivers/event/octeontx2 drivers/event/opdl
+			drivers/net/bnx2x drivers/net/hinic drivers/net/hns3
+			drivers/net/memif drivers/net/thunderx
+			drivers/net/virtio examples/l2fwd-event" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Use of rte_atomicNN_xxx APIs not allowed, use rte_atomic_xxx APIs' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
+	# refrain from using compiler __sync built-ins
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="__sync_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Use of __sync_xxx built-ins not allowed, use rte_atomic_xxx APIs' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
                         ` (2 preceding siblings ...)
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 3/4] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-05-12  8:03       ` Phil Yang
  2020-05-12 11:18         ` Morten Brørup
  2020-05-12 18:20         ` Stephen Hemminger
  2020-05-12  8:18       ` [dpdk-dev] [PATCH v4 0/4] generic rte atomic APIs deprecate proposal Phil Yang
  2020-05-26  9:01       ` [dpdk-dev] [PATCH v5 " Phil Yang
  5 siblings, 2 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:03 UTC (permalink / raw)
  To: thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, phil.yang, nd

Wraps up compiler c11 atomic built-ins with explicit memory ordering
parameter.

Signed-off-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/include/generic/rte_atomic_c11.h | 139 ++++++++++++++++++++++++
 lib/librte_eal/include/meson.build              |   1 +
 2 files changed, 140 insertions(+)
 create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h

diff --git a/lib/librte_eal/include/generic/rte_atomic_c11.h b/lib/librte_eal/include/generic/rte_atomic_c11.h
new file mode 100644
index 0000000..20490f4
--- /dev/null
+++ b/lib/librte_eal/include/generic/rte_atomic_c11.h
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Arm Limited
+ */
+
+#ifndef _RTE_ATOMIC_C11_H_
+#define _RTE_ATOMIC_C11_H_
+
+#include <rte_common.h>
+
+/**
+ * @file
+ * c11 atomic operations
+ *
+ * This file wraps up compiler (GCC) c11 atomic built-ins.
+ * https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
+ */
+
+#define memory_order_relaxed __ATOMIC_RELAXED
+#define memory_order_consume __ATOMIC_CONSUME
+#define memory_order_acquire __ATOMIC_ACQUIRE
+#define memory_order_release __ATOMIC_RELEASE
+#define memory_order_acq_rel __ATOMIC_ACQ_REL
+#define memory_order_seq_cst __ATOMIC_SEQ_CST
+
+/* Generic atomic load.
+ * It returns the contents of *PTR.
+ *
+ * The valid memory order variants are:
+ * memory_order_relaxed
+ * memory_order_consume
+ * memory_order_acquire
+ * memory_order_seq_cst
+ */
+#define rte_atomic_load(PTR, MO)			\
+	(__extension__ ({				\
+		typeof(PTR) _ptr = (PTR);		\
+		typeof(*_ptr) _ret;			\
+		__atomic_load(_ptr, &_ret, (MO));	\
+		_ret;					\
+	}))
+
+/* Generic atomic store.
+ * It stores the value of VAL into *PTR.
+ *
+ * The valid memory order variants are:
+ * memory_order_relaxed
+ * memory_order_release
+ * memory_order_seq_cst
+ */
+#define rte_atomic_store(PTR, VAL, MO)			\
+	(__extension__ ({				\
+		typeof(PTR) _ptr = (PTR);		\
+		typeof(*_ptr) _val = (VAL);		\
+		__atomic_store(_ptr, &_val, (MO));	\
+	}))
+
+/* Generic atomic exchange.
+ * It stores the value of VAL into *PTR.
+ * It returns the original value of *PTR.
+ *
+ * The valid memory order variants are:
+ * memory_order_relaxed
+ * memory_order_acquire
+ * memory_order_release
+ * memory_order_acq_rel
+ * memory_order_seq_cst
+ */
+#define rte_atomic_exchange(PTR, VAL, MO)			\
+	(__extension__ ({					\
+		typeof(PTR) _ptr = (PTR);			\
+		typeof(*_ptr) _val = (VAL);			\
+		typeof(*_ptr) _ret;				\
+		__atomic_exchange(_ptr, &_val, &_ret, (MO));	\
+		_ret;						\
+	}))
+
+/* Generic atomic compare and exchange.
+ * It compares the contents of *PTR with the contents of *EXP.
+ * If equal, the operation is a read-modify-write operation that
+ * writes DES into *PTR.
+ * If they are not equal, the operation is a read and the current
+ * contents of *PTR are written into *EXP.
+ *
+ * The weak compare_exchange may fail spuriously and the strong
+ * variation will never fails spuriously.
+ *
+ * If DES is written into *PTR then true is returned and memory is
+ * affected according to the memory order specified by SUC_MO.
+ * There are no restrictions on what memory order can be used here.
+ *
+ * Otherwise, false is returned and memory is affected according to
+ * FAIL_MO. This memory order cannot be memory_order_release nor
+ * memory_order_acq_rel. It also cannot be a stronger order than that
+ * specified by SUC_MO.
+ */
+#define rte_atomic_compare_exchange_weak(PTR, EXP, DES, SUC_MO, FAIL_MO)    \
+	(__extension__ ({						    \
+		typeof(PTR) _ptr = (PTR);				    \
+		typeof(*_ptr) _des = (DES);				    \
+		__atomic_compare_exchange(_ptr, (EXP), &_des, 1,	    \
+				 (SUC_MO), (FAIL_MO));			    \
+	}))
+
+#define rte_atomic_compare_exchange_strong(PTR, EXP, DES, SUC_MO, FAIL_MO)  \
+	(__extension__ ({						    \
+		typeof(PTR) _ptr = (PTR);				    \
+		typeof(*_ptr) _des = (DES);				    \
+		__atomic_compare_exchange(_ptr, (EXP), &_des, 0,	    \
+				 (SUC_MO), (FAIL_MO));			    \
+	}))
+
+#define rte_atomic_fetch_add(PTR, VAL, MO)		\
+	__atomic_fetch_add((PTR), (VAL), (MO))
+#define rte_atomic_fetch_sub(PTR, VAL, MO)		\
+	__atomic_fetch_sub((PTR), (VAL), (MO))
+#define rte_atomic_fetch_or(PTR, VAL, MO)		\
+	__atomic_fetch_or((PTR), (VAL), (MO))
+#define rte_atomic_fetch_xor(PTR, VAL, MO)		\
+	__atomic_fetch_xor((PTR), (VAL), (MO))
+#define rte_atomic_fetch_and(PTR, VAL, MO)		\
+	__atomic_fetch_and((PTR), (VAL), (MO))
+
+#define rte_atomic_add_fetch(PTR, VAL, MO)		\
+	__atomic_add_fetch((PTR), (VAL), (MO))
+#define rte_atomic_sub_fetch(PTR, VAL, MO)		\
+	__atomic_sub_fetch((PTR), (VAL), (MO))
+#define rte_atomic_or_fetch(PTR, VAL, MO)		\
+	__atomic_or_fetch((PTR), (VAL), (MO))
+#define rte_atomic_xor_fetch(PTR, VAL, MO)		\
+	__atomic_xor_fetch((PTR), (VAL), (MO))
+#define rte_atomic_and_fetch(PTR, VAL, MO)		\
+	__atomic_and_fetch((PTR), (VAL), (MO))
+
+/* Synchronization fence between threads based on
+ * the specified memory order.
+ */
+#define rte_atomic_thread_fence(MO) __atomic_thread_fence((MO))
+
+#endif /* _RTE_ATOMIC_C11_H_ */
diff --git a/lib/librte_eal/include/meson.build b/lib/librte_eal/include/meson.build
index bc73ec2..dac1aac 100644
--- a/lib/librte_eal/include/meson.build
+++ b/lib/librte_eal/include/meson.build
@@ -51,6 +51,7 @@ headers += files(
 # special case install the generic headers, since they go in a subdir
 generic_headers = files(
 	'generic/rte_atomic.h',
+	'generic/rte_atomic_c11.h',
 	'generic/rte_byteorder.h',
 	'generic/rte_cpuflags.h',
 	'generic/rte_cycles.h',
-- 
2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 0/4] generic rte atomic APIs deprecate proposal
  2020-05-12  8:03     ` [dpdk-dev] [PATCH v4 0/4] " Phil Yang
                         ` (3 preceding siblings ...)
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics Phil Yang
@ 2020-05-12  8:18       ` Phil Yang
  2020-05-26  9:01       ` [dpdk-dev] [PATCH v5 " Phil Yang
  5 siblings, 0 replies; 219+ messages in thread
From: Phil Yang @ 2020-05-12  8:18 UTC (permalink / raw)
  To: Phil Yang, thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	Honnappa Nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, nd, David Christensen, nd

+ David Christensen

The PPC c11 atomics maintainer.

My apologies, I forgot to cc you on this email.

Thanks,
Phil Yang

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, May 12, 2020 4:03 PM
> To: thomas@monjalon.net; dev@dpdk.org
> Cc: bruce.richardson@intel.com; ferruh.yigit@intel.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com;
> ktraynor@redhat.com; konstantin.ananyev@intel.com;
> maxime.coquelin@redhat.com; olivier.matz@6wind.com;
> stephen@networkplumber.org; mb@smartsharesystems.com;
> mattias.ronnblom@ericsson.com; harry.van.haaren@intel.com;
> erik.g.carrillo@intel.com; Phil Yang <Phil.Yang@arm.com>; nd <nd@arm.com>
> Subject: [PATCH v4 0/4] generic rte atomic APIs deprecate proposal
> 
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
> 
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:
> 
> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> This patchset contains:
> 1) changes to programmer guide describing writing efficient code for aarch64.
> 2) the checkpatch script changes to flag rte_atomicNN_xxx API usage in
> patches.
> 3) wraps up compiler __atomic built-ins with explicit memory ordering
> parameter.
> 
> v4:
> 1. add reader-writer concurrency case describing.
> 2. claim maintainership of c11 atomics code for each platforms.
> 3. flag rte_atomicNN_xxx in new patches for modules that have been
> converted to
> c11 style.
> 4. flag __sync_xxx built-ins in new patches.
> 5. wraps up compiler atomic built-ins
> 6. move the changes of libraries which make use of c11 atomic APIs out of
> this
> patchset.
> 
> v3:
> add libatomic dependency for 32-bit clang
> 
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
> 
> 
> Phil Yang (4):
>   doc: add generic atomic deprecation section
>   maintainers: claim maintainers of c11 atomics code
>   devtools: prevent use of rte atomic APIs in future patches
>   eal/atomic: add wrapper for c11 atomics
> 
>  MAINTAINERS                                      |   4 +
>  devtools/checkpatches.sh                         |  23 ++++
>  doc/guides/prog_guide/writing_efficient_code.rst | 139
> ++++++++++++++++++++++-
>  lib/librte_eal/include/generic/rte_atomic_c11.h  | 139
> +++++++++++++++++++++++
>  lib/librte_eal/include/meson.build               |   1 +
>  5 files changed, 305 insertions(+), 1 deletion(-)
>  create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics Phil Yang
@ 2020-05-12 11:18         ` Morten Brørup
  2020-05-13  9:40           ` Phil Yang
  2020-05-12 18:20         ` Stephen Hemminger
  1 sibling, 1 reply; 219+ messages in thread
From: Morten Brørup @ 2020-05-12 11:18 UTC (permalink / raw)
  To: Phil Yang, thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	Honnappa Nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, nd, David Christensen, nd

> From: Phil Yang [mailto:phil.yang@arm.com]
> Sent: Tuesday, May 12, 2020 10:03 AM
> 
> Wraps up compiler c11 atomic built-ins with explicit memory ordering
> parameter.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> ---
>  lib/librte_eal/include/generic/rte_atomic_c11.h | 139
> ++++++++++++++++++++++++
>  lib/librte_eal/include/meson.build              |   1 +
>  2 files changed, 140 insertions(+)
>  create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h
> 
> diff --git a/lib/librte_eal/include/generic/rte_atomic_c11.h
> b/lib/librte_eal/include/generic/rte_atomic_c11.h
> new file mode 100644
> index 0000000..20490f4
> --- /dev/null
> +++ b/lib/librte_eal/include/generic/rte_atomic_c11.h
> @@ -0,0 +1,139 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2020 Arm Limited
> + */
> +
> +#ifndef _RTE_ATOMIC_C11_H_
> +#define _RTE_ATOMIC_C11_H_
> +
> +#include <rte_common.h>
> +
> +/**
> + * @file
> + * c11 atomic operations
> + *
> + * This file wraps up compiler (GCC) c11 atomic built-ins.
> + * https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> + */
> +
> +#define memory_order_relaxed __ATOMIC_RELAXED
> +#define memory_order_consume __ATOMIC_CONSUME
> +#define memory_order_acquire __ATOMIC_ACQUIRE
> +#define memory_order_release __ATOMIC_RELEASE
> +#define memory_order_acq_rel __ATOMIC_ACQ_REL
> +#define memory_order_seq_cst __ATOMIC_SEQ_CST

Why redefine these instead of using the original names?

If we need to redefine them, they should be upper case and RTE_ prefixed.

> +
> +/* Generic atomic load.
> + * It returns the contents of *PTR.
> + *
> + * The valid memory order variants are:
> + * memory_order_relaxed
> + * memory_order_consume
> + * memory_order_acquire
> + * memory_order_seq_cst
> + */
> +#define rte_atomic_load(PTR, MO)			\
> +	(__extension__ ({				\
> +		typeof(PTR) _ptr = (PTR);		\
> +		typeof(*_ptr) _ret;			\
> +		__atomic_load(_ptr, &_ret, (MO));	\
> +		_ret;					\
> +	}))
> +
> +/* Generic atomic store.
> + * It stores the value of VAL into *PTR.
> + *
> + * The valid memory order variants are:
> + * memory_order_relaxed
> + * memory_order_release
> + * memory_order_seq_cst
> + */
> +#define rte_atomic_store(PTR, VAL, MO)			\
> +	(__extension__ ({				\
> +		typeof(PTR) _ptr = (PTR);		\
> +		typeof(*_ptr) _val = (VAL);		\
> +		__atomic_store(_ptr, &_val, (MO));	\
> +	}))
> +
> +/* Generic atomic exchange.
> + * It stores the value of VAL into *PTR.
> + * It returns the original value of *PTR.
> + *
> + * The valid memory order variants are:
> + * memory_order_relaxed
> + * memory_order_acquire
> + * memory_order_release
> + * memory_order_acq_rel
> + * memory_order_seq_cst
> + */
> +#define rte_atomic_exchange(PTR, VAL, MO)			\
> +	(__extension__ ({					\
> +		typeof(PTR) _ptr = (PTR);			\
> +		typeof(*_ptr) _val = (VAL);			\
> +		typeof(*_ptr) _ret;				\
> +		__atomic_exchange(_ptr, &_val, &_ret, (MO));	\
> +		_ret;						\
> +	}))
> +
> +/* Generic atomic compare and exchange.
> + * It compares the contents of *PTR with the contents of *EXP.
> + * If equal, the operation is a read-modify-write operation that
> + * writes DES into *PTR.
> + * If they are not equal, the operation is a read and the current
> + * contents of *PTR are written into *EXP.
> + *
> + * The weak compare_exchange may fail spuriously and the strong
> + * variation will never fails spuriously.

"will never fails spuriously" -> "will never fail" / "never fails".

And I suggest that you elaborate what "fail" means here,
i.e. what exactly can happen when it fails.

> + *
> + * If DES is written into *PTR then true is returned and memory is
> + * affected according to the memory order specified by SUC_MO.
> + * There are no restrictions on what memory order can be used here.
> + *
> + * Otherwise, false is returned and memory is affected according to
> + * FAIL_MO. This memory order cannot be memory_order_release nor
> + * memory_order_acq_rel. It also cannot be a stronger order than that
> + * specified by SUC_MO.
> + */
> +#define rte_atomic_compare_exchange_weak(PTR, EXP, DES, SUC_MO,
> FAIL_MO)    \
> +	(__extension__ ({						    \
> +		typeof(PTR) _ptr = (PTR);				    \
> +		typeof(*_ptr) _des = (DES);				    \
> +		__atomic_compare_exchange(_ptr, (EXP), &_des, 1,	    \
> +				 (SUC_MO), (FAIL_MO));			    \
> +	}))
> +
> +#define rte_atomic_compare_exchange_strong(PTR, EXP, DES, SUC_MO,
> FAIL_MO)  \
> +	(__extension__ ({						    \
> +		typeof(PTR) _ptr = (PTR);				    \
> +		typeof(*_ptr) _des = (DES);				    \
> +		__atomic_compare_exchange(_ptr, (EXP), &_des, 0,	    \
> +				 (SUC_MO), (FAIL_MO));			    \
> +	}))
> +
> +#define rte_atomic_fetch_add(PTR, VAL, MO)		\
> +	__atomic_fetch_add((PTR), (VAL), (MO))
> +#define rte_atomic_fetch_sub(PTR, VAL, MO)		\
> +	__atomic_fetch_sub((PTR), (VAL), (MO))
> +#define rte_atomic_fetch_or(PTR, VAL, MO)		\
> +	__atomic_fetch_or((PTR), (VAL), (MO))
> +#define rte_atomic_fetch_xor(PTR, VAL, MO)		\
> +	__atomic_fetch_xor((PTR), (VAL), (MO))
> +#define rte_atomic_fetch_and(PTR, VAL, MO)		\
> +	__atomic_fetch_and((PTR), (VAL), (MO))
> +
> +#define rte_atomic_add_fetch(PTR, VAL, MO)		\
> +	__atomic_add_fetch((PTR), (VAL), (MO))
> +#define rte_atomic_sub_fetch(PTR, VAL, MO)		\
> +	__atomic_sub_fetch((PTR), (VAL), (MO))
> +#define rte_atomic_or_fetch(PTR, VAL, MO)		\
> +	__atomic_or_fetch((PTR), (VAL), (MO))
> +#define rte_atomic_xor_fetch(PTR, VAL, MO)		\
> +	__atomic_xor_fetch((PTR), (VAL), (MO))
> +#define rte_atomic_and_fetch(PTR, VAL, MO)		\
> +	__atomic_and_fetch((PTR), (VAL), (MO))
> +
> +/* Synchronization fence between threads based on
> + * the specified memory order.
> + */
> +#define rte_atomic_thread_fence(MO) __atomic_thread_fence((MO))
> +
> +#endif /* _RTE_ATOMIC_C11_H_ */
> diff --git a/lib/librte_eal/include/meson.build
> b/lib/librte_eal/include/meson.build
> index bc73ec2..dac1aac 100644
> --- a/lib/librte_eal/include/meson.build
> +++ b/lib/librte_eal/include/meson.build
> @@ -51,6 +51,7 @@ headers += files(
>  # special case install the generic headers, since they go in a subdir
>  generic_headers = files(
>  	'generic/rte_atomic.h',
> +	'generic/rte_atomic_c11.h',
>  	'generic/rte_byteorder.h',
>  	'generic/rte_cpuflags.h',
>  	'generic/rte_cycles.h',
> --
> 2.7.4
> 

Thumbs up for the good function documentation. :-)


Med venlig hilsen / kind regards
- Morten Brørup




^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12  8:03       ` [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics Phil Yang
  2020-05-12 11:18         ` Morten Brørup
@ 2020-05-12 18:20         ` Stephen Hemminger
  2020-05-12 19:23           ` Honnappa Nagarahalli
  2020-05-13 19:25           ` Mattias Rönnblom
  1 sibling, 2 replies; 219+ messages in thread
From: Stephen Hemminger @ 2020-05-12 18:20 UTC (permalink / raw)
  To: Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, mb, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, nd

On Tue, May 12, 2020 at 4:03 pm, Phil Yang <phil.yang@arm.com> wrote:
> parameter.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com 
> <mailto:phil.yang@arm.com>>


What is the purpose of having rte_atomic at all?
Is this level of indirection really helping?



^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12 18:20         ` Stephen Hemminger
@ 2020-05-12 19:23           ` Honnappa Nagarahalli
  2020-05-13  8:57             ` Morten Brørup
  2020-05-13 11:53             ` Ananyev, Konstantin
  2020-05-13 19:25           ` Mattias Rönnblom
  1 sibling, 2 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-12 19:23 UTC (permalink / raw)
  To: Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, mb, mattias.ronnblom, harry.van.haaren,
	erik.g.carrillo, nd, Honnappa Nagarahalli, nd

<snip>

Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics

On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com> wrote:

parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>


What is the purpose of having rte_atomic at all?
Is this level of indirection really helping? 
[HONNAPPA] (not sure why this email has html format, converted to text format)
I believe you meant, why not use the __atomic_xxx built-ins directly? The only reason for now is handling of __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to rte_smp_mb which has an optimized implementation for x86. According to Konstantin, the compiler does not generate optimal code. Wrapping that built-in alone is going to be confusing.

The wrappers also allow us to have our own implementation using inline assembly for compilers versions that do not support C11 atomic built-ins. But, I do not know if there is a need to support those versions.

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12 19:23           ` Honnappa Nagarahalli
@ 2020-05-13  8:57             ` Morten Brørup
  2020-05-13 15:30               ` Honnappa Nagarahalli
  2020-05-13 19:04               ` Mattias Rönnblom
  2020-05-13 11:53             ` Ananyev, Konstantin
  1 sibling, 2 replies; 219+ messages in thread
From: Morten Brørup @ 2020-05-13  8:57 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, mattias.ronnblom, harry.van.haaren,
	erik.g.carrillo, nd, David Christensen

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Tuesday, May 12, 2020 9:24 PM
> 
> <snip>
> 
> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> 
> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
> wrote:
> 
> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> 
> 
> What is the purpose of having rte_atomic at all?
> Is this level of indirection really helping?
> [HONNAPPA] (not sure why this email has html format, converted to text
> format)
> I believe you meant, why not use the __atomic_xxx built-ins directly?
> The only reason for now is handling of
> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to
> rte_smp_mb which has an optimized implementation for x86. According to
> Konstantin, the compiler does not generate optimal code. Wrapping that
> built-in alone is going to be confusing.
> 
> The wrappers also allow us to have our own implementation using inline
> assembly for compilers versions that do not support C11 atomic built-
> ins. But, I do not know if there is a need to support those versions.

If I recall correctly, someone mentioned that one (or more) of the aging enterprise Linux distributions don't include a compiler with C11 atomics.

I think Stephen is onto something here...

It is silly to add wrappers like this, if the only purpose is to support compilers and distributions that don't properly support an official C standard which is nearly a decade old. The quality and quantity of the DPDK documentation for these functions (including examples, discussions on Stack Overflow, etc.) will be inferior to the documentation of the standard C11 atomics, which increases the probability of incorrect use.

And if some compiler generates code that is suboptimal for a user, then it should be the choice of the user to either accept it or use a better compiler. Using a suboptimal compiler will not only affect the user's DPDK applications, but all applications developed by the user. And if he accepts it for his other applications, he will also accept it for his DPDK applications.

We could introduce some sort of marker or standardized comment to indicate when functions only exist for backwards compatibility with ancient compilers and similar, with a reference to documentation describing why. And when the documented preconditions are no longer relevant, e.g. when those particular enterprise Linux distributions become obsolete, these functions become obsolete too, and should be removed. However, getting rid of obsolete cruft will break the ABI. In other words: Added cruft will never be removed again, so think twice before adding.


Med venlig hilsen / kind regards
- Morten Brørup




^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12 11:18         ` Morten Brørup
@ 2020-05-13  9:40           ` Phil Yang
  2020-05-13 15:32             ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Phil Yang @ 2020-05-13  9:40 UTC (permalink / raw)
  To: Morten Brørup, thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal,
	Honnappa Nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, stephen, mattias.ronnblom,
	harry.van.haaren, erik.g.carrillo, nd, David Christensen, nd

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Tuesday, May 12, 2020 7:18 PM
> To: Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; dev@dpdk.org
> Cc: bruce.richardson@intel.com; ferruh.yigit@intel.com;
> hemant.agrawal@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; jerinj@marvell.com;
> ktraynor@redhat.com; konstantin.ananyev@intel.com;
> maxime.coquelin@redhat.com; olivier.matz@6wind.com;
> stephen@networkplumber.org; mattias.ronnblom@ericsson.com;
> harry.van.haaren@intel.com; erik.g.carrillo@intel.com; nd <nd@arm.com>;
> David Christensen <drc@linux.vnet.ibm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> 
> > From: Phil Yang [mailto:phil.yang@arm.com]
> > Sent: Tuesday, May 12, 2020 10:03 AM
> >
> > Wraps up compiler c11 atomic built-ins with explicit memory ordering
> > parameter.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > ---
> >  lib/librte_eal/include/generic/rte_atomic_c11.h | 139
> > ++++++++++++++++++++++++
> >  lib/librte_eal/include/meson.build              |   1 +
> >  2 files changed, 140 insertions(+)
> >  create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h
> >
> > diff --git a/lib/librte_eal/include/generic/rte_atomic_c11.h
> > b/lib/librte_eal/include/generic/rte_atomic_c11.h
> > new file mode 100644
> > index 0000000..20490f4
> > --- /dev/null
> > +++ b/lib/librte_eal/include/generic/rte_atomic_c11.h
> > @@ -0,0 +1,139 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2020 Arm Limited
> > + */
> > +
> > +#ifndef _RTE_ATOMIC_C11_H_
> > +#define _RTE_ATOMIC_C11_H_
> > +
> > +#include <rte_common.h>
> > +
> > +/**
> > + * @file
> > + * c11 atomic operations
> > + *
> > + * This file wraps up compiler (GCC) c11 atomic built-ins.
> > + * https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > + */
> > +
> > +#define memory_order_relaxed __ATOMIC_RELAXED
> > +#define memory_order_consume __ATOMIC_CONSUME
> > +#define memory_order_acquire __ATOMIC_ACQUIRE
> > +#define memory_order_release __ATOMIC_RELEASE
> > +#define memory_order_acq_rel __ATOMIC_ACQ_REL
> > +#define memory_order_seq_cst __ATOMIC_SEQ_CST
> 
> Why redefine these instead of using the original names?
> 
> If we need to redefine them, they should be upper case and RTE_ prefixed.

Agreed, we don't need to redefine them. I was trying to align with the stdatomic library. 
I will remove them in the next version.

> 
> > +
> > +/* Generic atomic load.
> > + * It returns the contents of *PTR.
> > + *
> > + * The valid memory order variants are:
> > + * memory_order_relaxed
> > + * memory_order_consume
> > + * memory_order_acquire
> > + * memory_order_seq_cst
> > + */
> > +#define rte_atomic_load(PTR, MO)			\
> > +	(__extension__ ({				\
> > +		typeof(PTR) _ptr = (PTR);		\
> > +		typeof(*_ptr) _ret;			\
> > +		__atomic_load(_ptr, &_ret, (MO));	\
> > +		_ret;					\
> > +	}))
> > +
> > +/* Generic atomic store.
> > + * It stores the value of VAL into *PTR.
> > + *
> > + * The valid memory order variants are:
> > + * memory_order_relaxed
> > + * memory_order_release
> > + * memory_order_seq_cst
> > + */
> > +#define rte_atomic_store(PTR, VAL, MO)			\
> > +	(__extension__ ({				\
> > +		typeof(PTR) _ptr = (PTR);		\
> > +		typeof(*_ptr) _val = (VAL);		\
> > +		__atomic_store(_ptr, &_val, (MO));	\
> > +	}))
> > +
> > +/* Generic atomic exchange.
> > + * It stores the value of VAL into *PTR.
> > + * It returns the original value of *PTR.
> > + *
> > + * The valid memory order variants are:
> > + * memory_order_relaxed
> > + * memory_order_acquire
> > + * memory_order_release
> > + * memory_order_acq_rel
> > + * memory_order_seq_cst
> > + */
> > +#define rte_atomic_exchange(PTR, VAL, MO)			\
> > +	(__extension__ ({					\
> > +		typeof(PTR) _ptr = (PTR);			\
> > +		typeof(*_ptr) _val = (VAL);			\
> > +		typeof(*_ptr) _ret;				\
> > +		__atomic_exchange(_ptr, &_val, &_ret, (MO));	\
> > +		_ret;						\
> > +	}))
> > +
> > +/* Generic atomic compare and exchange.
> > + * It compares the contents of *PTR with the contents of *EXP.
> > + * If equal, the operation is a read-modify-write operation that
> > + * writes DES into *PTR.
> > + * If they are not equal, the operation is a read and the current
> > + * contents of *PTR are written into *EXP.
> > + *
> > + * The weak compare_exchange may fail spuriously and the strong
> > + * variation will never fails spuriously.
> 
> "will never fails spuriously" -> "will never fail" / "never fails".

Thanks, I will fix it in the next version.

> 
> And I suggest that you elaborate what "fail" means here,
> i.e. what exactly can happen when it fails.

Yes. That would be better. I will update it in the new version.
Fail spuriously means the compare exchange operation acts as *PTR != *EXP and return false even if they are equal.

> 
> > + *
> > + * If DES is written into *PTR then true is returned and memory is
> > + * affected according to the memory order specified by SUC_MO.
> > + * There are no restrictions on what memory order can be used here.
> > + *
> > + * Otherwise, false is returned and memory is affected according to
> > + * FAIL_MO. This memory order cannot be memory_order_release nor
> > + * memory_order_acq_rel. It also cannot be a stronger order than that
> > + * specified by SUC_MO.
> > + */
> > +#define rte_atomic_compare_exchange_weak(PTR, EXP, DES, SUC_MO,
> > FAIL_MO)    \
> > +	(__extension__ ({						    \
> > +		typeof(PTR) _ptr = (PTR);				    \
> > +		typeof(*_ptr) _des = (DES);				    \
> > +		__atomic_compare_exchange(_ptr, (EXP), &_des, 1,	    \
> > +				 (SUC_MO), (FAIL_MO));
> 	    \
> > +	}))
> > +
> > +#define rte_atomic_compare_exchange_strong(PTR, EXP, DES, SUC_MO,
> > FAIL_MO)  \
> > +	(__extension__ ({						    \
> > +		typeof(PTR) _ptr = (PTR);				    \
> > +		typeof(*_ptr) _des = (DES);				    \
> > +		__atomic_compare_exchange(_ptr, (EXP), &_des, 0,	    \
> > +				 (SUC_MO), (FAIL_MO));
> 	    \
> > +	}))
> > +
> > +#define rte_atomic_fetch_add(PTR, VAL, MO)		\
> > +	__atomic_fetch_add((PTR), (VAL), (MO))
> > +#define rte_atomic_fetch_sub(PTR, VAL, MO)		\
> > +	__atomic_fetch_sub((PTR), (VAL), (MO))
> > +#define rte_atomic_fetch_or(PTR, VAL, MO)		\
> > +	__atomic_fetch_or((PTR), (VAL), (MO))
> > +#define rte_atomic_fetch_xor(PTR, VAL, MO)		\
> > +	__atomic_fetch_xor((PTR), (VAL), (MO))
> > +#define rte_atomic_fetch_and(PTR, VAL, MO)		\
> > +	__atomic_fetch_and((PTR), (VAL), (MO))
> > +
> > +#define rte_atomic_add_fetch(PTR, VAL, MO)		\
> > +	__atomic_add_fetch((PTR), (VAL), (MO))
> > +#define rte_atomic_sub_fetch(PTR, VAL, MO)		\
> > +	__atomic_sub_fetch((PTR), (VAL), (MO))
> > +#define rte_atomic_or_fetch(PTR, VAL, MO)		\
> > +	__atomic_or_fetch((PTR), (VAL), (MO))
> > +#define rte_atomic_xor_fetch(PTR, VAL, MO)		\
> > +	__atomic_xor_fetch((PTR), (VAL), (MO))
> > +#define rte_atomic_and_fetch(PTR, VAL, MO)		\
> > +	__atomic_and_fetch((PTR), (VAL), (MO))
> > +
> > +/* Synchronization fence between threads based on
> > + * the specified memory order.
> > + */
> > +#define rte_atomic_thread_fence(MO) __atomic_thread_fence((MO))
> > +
> > +#endif /* _RTE_ATOMIC_C11_H_ */
> > diff --git a/lib/librte_eal/include/meson.build
> > b/lib/librte_eal/include/meson.build
> > index bc73ec2..dac1aac 100644
> > --- a/lib/librte_eal/include/meson.build
> > +++ b/lib/librte_eal/include/meson.build
> > @@ -51,6 +51,7 @@ headers += files(
> >  # special case install the generic headers, since they go in a subdir
> >  generic_headers = files(
> >  	'generic/rte_atomic.h',
> > +	'generic/rte_atomic_c11.h',
> >  	'generic/rte_byteorder.h',
> >  	'generic/rte_cpuflags.h',
> >  	'generic/rte_cycles.h',
> > --
> > 2.7.4
> >
> 
> Thumbs up for the good function documentation. :-)

Thank you for your comments.

Thanks,
Phil

> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12 19:23           ` Honnappa Nagarahalli
  2020-05-13  8:57             ` Morten Brørup
@ 2020-05-13 11:53             ` Ananyev, Konstantin
  2020-05-13 15:06               ` Honnappa Nagarahalli
  1 sibling, 1 reply; 219+ messages in thread
From: Ananyev, Konstantin @ 2020-05-13 11:53 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, Richardson, Bruce, Yigit, Ferruh, hemant.agrawal,
	jerinj, ktraynor, maxime.coquelin, olivier.matz, mb,
	mattias.ronnblom, Van Haaren, Harry, Carrillo, Erik G, nd, nd

> <snip>
> 
> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> 
> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com> wrote:
> 
> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> 
> 
> What is the purpose of having rte_atomic at all?
> Is this level of indirection really helping?
> [HONNAPPA] (not sure why this email has html format, converted to text format)
> I believe you meant, why not use the __atomic_xxx built-ins directly? The only reason for now is handling of
> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to rte_smp_mb which has an optimized implementation for x86.
> According to Konstantin, the compiler does not generate optimal code. Wrapping that built-in alone is going to be confusing.
> 
> The wrappers also allow us to have our own implementation using inline assembly for compilers versions that do not support C11 atomic
> built-ins. But, I do not know if there is a need to support those versions.

Few thoughts from my side about that patch:
Yes, for __atomic_thread_fence(__ATOMIC_SEQ_CST) generates full 'mfence' which is quite expensive,
and can be a avoided for SMP case.
Though I don't see why we need to create our own wrappers  for *all*  __atomic buitins.
From my perspective it would be sufficient to just introduce few of them:
rte_thread_fence_XXX (where XXX - supported memory-orders: RELEASE, ACUIQRE, SEQ_CST, etc.).
For all other __atomic built-ins I don't see any problem to use them directly,
without introducing any wrappers around.  

As a side note, this patch implements rte_atomic_thread_fence() as a simple wrapper around
__atomic_thread_fence(), so concern mentioned above is not addressed.

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13 11:53             ` Ananyev, Konstantin
@ 2020-05-13 15:06               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-13 15:06 UTC (permalink / raw)
  To: Ananyev, Konstantin, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, Richardson, Bruce, Yigit, Ferruh, hemant.agrawal,
	jerinj, ktraynor, maxime.coquelin, olivier.matz, mb,
	mattias.ronnblom, Van Haaren, Harry, Carrillo, Erik G, nd,
	Honnappa Nagarahalli, nd

> > <snip>
> >
> > Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> >
> > On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
> wrote:
> >
> > parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> >
> >
> > What is the purpose of having rte_atomic at all?
> > Is this level of indirection really helping?
> > [HONNAPPA] (not sure why this email has html format, converted to text
> > format) I believe you meant, why not use the __atomic_xxx built-ins
> > directly? The only reason for now is handling of
> > __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to
> rte_smp_mb which has an optimized implementation for x86.
> > According to Konstantin, the compiler does not generate optimal code.
> Wrapping that built-in alone is going to be confusing.
> >
> > The wrappers also allow us to have our own implementation using inline
> > assembly for compilers versions that do not support C11 atomic built-ins.
> But, I do not know if there is a need to support those versions.
> 
> Few thoughts from my side about that patch:
> Yes, for __atomic_thread_fence(__ATOMIC_SEQ_CST) generates full 'mfence'
> which is quite expensive, and can be a avoided for SMP case.
> Though I don't see why we need to create our own wrappers  for *all*
> __atomic buitins.
> From my perspective it would be sufficient to just introduce few of them:
> rte_thread_fence_XXX (where XXX - supported memory-orders: RELEASE,
> ACUIQRE, SEQ_CST, etc.).
> For all other __atomic built-ins I don't see any problem to use them directly,
> without introducing any wrappers around.
I am all for not doing wrappers for the sake of doing. Here, we were concerned about the uniformity of the code, hence did the wrappers for all. Does, anyone have any concerns with doing the wrappers only for __atomic_thread_fence?

Is there any possibility that the compiler will change in the future to generate the optimized code for x86?

For the API, we already have 'rte_atomic128_cmp_exchange' implemented with C11 semantics, I suggest we keep this one also on the same lines. This would require the memory order to be a parameter.

> 
> As a side note, this patch implements rte_atomic_thread_fence() as a simple
> wrapper around __atomic_thread_fence(), so concern mentioned above is not
> addressed.
Agreed. So, we will just pick the implementation of rte_smp_mb for x86 for this.

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13  8:57             ` Morten Brørup
@ 2020-05-13 15:30               ` Honnappa Nagarahalli
  2020-05-13 19:04               ` Mattias Rönnblom
  1 sibling, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-13 15:30 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, mattias.ronnblom, harry.van.haaren,
	erik.g.carrillo, nd, David Christensen, Honnappa Nagarahalli, nd

<snip>
> > Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> >
> > On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
> > wrote:
> >
> > parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> >
> >
> > What is the purpose of having rte_atomic at all?
> > Is this level of indirection really helping?
> > [HONNAPPA] (not sure why this email has html format, converted to text
> > format)
> > I believe you meant, why not use the __atomic_xxx built-ins directly?
> > The only reason for now is handling of
> > __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to
> > rte_smp_mb which has an optimized implementation for x86. According to
> > Konstantin, the compiler does not generate optimal code. Wrapping that
> > built-in alone is going to be confusing.
> >
> > The wrappers also allow us to have our own implementation using inline
> > assembly for compilers versions that do not support C11 atomic built-
> > ins. But, I do not know if there is a need to support those versions.
> 
> If I recall correctly, someone mentioned that one (or more) of the aging
> enterprise Linux distributions don't include a compiler with C11 atomics.
I searched through the mailing list yesterday and I could not find anyone mentioning about compilers not supporting C11 built-ins. However, the C11 atomic APIs (as defined in stdatomic.h) are supported in later versions of the compilers. So, using C11 built-ins gives us better coverage with older compilers (including the ones being used in Intel CI which were the oldest versions mentioned on the mailing list).
IMO, we should not be worried about compilers that do not support C11.

> 
> I think Stephen is onto something here...
> 
> It is silly to add wrappers like this, if the only purpose is to support compilers
> and distributions that don't properly support an official C standard which is
> nearly a decade old. The quality and quantity of the DPDK documentation for
> these functions (including examples, discussions on Stack Overflow, etc.) will
> be inferior to the documentation of the standard C11 atomics, which
> increases the probability of incorrect use.
I agree. I do not want to add them for the sake of adding them. But, I do think that we need to solve the issues in DPDK (if they affect performance) which could be due to tools. As Konstantin suggested, we could do the wrappers only for the __atomic_thread_fence built-in. This will make life lot easier.

> 
> And if some compiler generates code that is suboptimal for a user, then it
> should be the choice of the user to either accept it or use a better compiler.
> Using a suboptimal compiler will not only affect the user's DPDK applications,
> but all applications developed by the user. And if he accepts it for his other
> applications, he will also accept it for his DPDK applications.
> 
> We could introduce some sort of marker or standardized comment to indicate
> when functions only exist for backwards compatibility with ancient compilers
> and similar, with a reference to documentation describing why. And when the
> documented preconditions are no longer relevant, e.g. when those particular
> enterprise Linux distributions become obsolete, these functions become
> obsolete too, and should be removed. However, getting rid of obsolete cruft
> will break the ABI. In other words: Added cruft will never be removed again,
> so think twice before adding.
> 
> 
> Med venlig hilsen / kind regards
> - Morten Brørup
> 
> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13  9:40           ` Phil Yang
@ 2020-05-13 15:32             ` Honnappa Nagarahalli
  0 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-13 15:32 UTC (permalink / raw)
  To: Phil Yang, Morten Brørup, thomas, dev
  Cc: bruce.richardson, ferruh.yigit, hemant.agrawal, jerinj, ktraynor,
	konstantin.ananyev, maxime.coquelin, olivier.matz, stephen,
	mattias.ronnblom, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen, nd, Honnappa Nagarahalli, nd

<snip>

> > Subject: RE: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> >
> > > From: Phil Yang [mailto:phil.yang@arm.com]
> > > Sent: Tuesday, May 12, 2020 10:03 AM
> > >
> > > Wraps up compiler c11 atomic built-ins with explicit memory ordering
> > > parameter.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > ---
> > >  lib/librte_eal/include/generic/rte_atomic_c11.h | 139
> > > ++++++++++++++++++++++++
> > >  lib/librte_eal/include/meson.build              |   1 +
> > >  2 files changed, 140 insertions(+)
> > >  create mode 100644 lib/librte_eal/include/generic/rte_atomic_c11.h
> > >
> > > diff --git a/lib/librte_eal/include/generic/rte_atomic_c11.h
> > > b/lib/librte_eal/include/generic/rte_atomic_c11.h
> > > new file mode 100644
> > > index 0000000..20490f4
> > > --- /dev/null
> > > +++ b/lib/librte_eal/include/generic/rte_atomic_c11.h
> > > @@ -0,0 +1,139 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2020 Arm Limited
> > > + */
> > > +
> > > +#ifndef _RTE_ATOMIC_C11_H_
> > > +#define _RTE_ATOMIC_C11_H_
> > > +
> > > +#include <rte_common.h>
> > > +
> > > +/**
> > > + * @file
> > > + * c11 atomic operations
> > > + *
> > > + * This file wraps up compiler (GCC) c11 atomic built-ins.
> > > + *
> > > +https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > > + */
> > > +
> > > +#define memory_order_relaxed __ATOMIC_RELAXED #define
> > > +memory_order_consume __ATOMIC_CONSUME #define
> memory_order_acquire
> > > +__ATOMIC_ACQUIRE #define memory_order_release
> __ATOMIC_RELEASE
> > > +#define memory_order_acq_rel __ATOMIC_ACQ_REL #define
> > > +memory_order_seq_cst __ATOMIC_SEQ_CST
> >
> > Why redefine these instead of using the original names?
> >
> > If we need to redefine them, they should be upper case and RTE_ prefixed.
> 
> Agreed, we don't need to redefine them. I was trying to align with the
> stdatomic library.
> I will remove them in the next version.
Agree, this will keep it inline with rte_atomic128_cmp_exchange API.

> 
> >
> > > +
> > > +/* Generic atomic load.
> > > + * It returns the contents of *PTR.
> > > + *
> > > + * The valid memory order variants are:
> > > + * memory_order_relaxed
> > > + * memory_order_consume
> > > + * memory_order_acquire
> > > + * memory_order_seq_cst
> > > + */
> > > +#define rte_atomic_load(PTR, MO)\
> > > +(__extension__ ({\
> > > +typeof(PTR) _ptr = (PTR);\
> > > +typeof(*_ptr) _ret;\
> > > +__atomic_load(_ptr, &_ret, (MO));\
> > > +_ret;\
> > > +}))
> > > +
> > > +/* Generic atomic store.
> > > + * It stores the value of VAL into *PTR.
> > > + *
> > > + * The valid memory order variants are:
> > > + * memory_order_relaxed
> > > + * memory_order_release
> > > + * memory_order_seq_cst
> > > + */
> > > +#define rte_atomic_store(PTR, VAL, MO)\ (__extension__ ({\
> > > +typeof(PTR) _ptr = (PTR);\
> > > +typeof(*_ptr) _val = (VAL);\
> > > +__atomic_store(_ptr, &_val, (MO));\
> > > +}))
> > > +
> > > +/* Generic atomic exchange.
> > > + * It stores the value of VAL into *PTR.
> > > + * It returns the original value of *PTR.
> > > + *
> > > + * The valid memory order variants are:
> > > + * memory_order_relaxed
> > > + * memory_order_acquire
> > > + * memory_order_release
> > > + * memory_order_acq_rel
> > > + * memory_order_seq_cst
> > > + */
> > > +#define rte_atomic_exchange(PTR, VAL, MO)\ (__extension__ ({\
> > > +typeof(PTR) _ptr = (PTR);\
> > > +typeof(*_ptr) _val = (VAL);\
> > > +typeof(*_ptr) _ret;\
> > > +__atomic_exchange(_ptr, &_val, &_ret, (MO));\ _ret;\
> > > +}))
> > > +
> > > +/* Generic atomic compare and exchange.
> > > + * It compares the contents of *PTR with the contents of *EXP.
> > > + * If equal, the operation is a read-modify-write operation that
> > > + * writes DES into *PTR.
> > > + * If they are not equal, the operation is a read and the current
> > > + * contents of *PTR are written into *EXP.
> > > + *
> > > + * The weak compare_exchange may fail spuriously and the strong
> > > + * variation will never fails spuriously.
> >
> > "will never fails spuriously" -> "will never fail" / "never fails".
> 
> Thanks, I will fix it in the next version.
> 
> >
> > And I suggest that you elaborate what "fail" means here, i.e. what
> > exactly can happen when it fails.
> 
> Yes. That would be better. I will update it in the new version.
> Fail spuriously means the compare exchange operation acts as *PTR != *EXP
> and return false even if they are equal.
> 
> >
> > > + *
> > > + * If DES is written into *PTR then true is returned and memory is
> > > + * affected according to the memory order specified by SUC_MO.
> > > + * There are no restrictions on what memory order can be used here.
> > > + *
> > > + * Otherwise, false is returned and memory is affected according to
> > > + * FAIL_MO. This memory order cannot be memory_order_release nor
> > > + * memory_order_acq_rel. It also cannot be a stronger order than
> > > +that
> > > + * specified by SUC_MO.
> > > + */
> > > +#define rte_atomic_compare_exchange_weak(PTR, EXP, DES, SUC_MO,
> > > FAIL_MO)    \
> > > +(__extension__ ({    \
> > > +typeof(PTR) _ptr = (PTR);    \
> > > +typeof(*_ptr) _des = (DES);    \
> > > +__atomic_compare_exchange(_ptr, (EXP), &_des, 1,    \
> > > + (SUC_MO), (FAIL_MO));
> >     \
> > > +}))
> > > +
> > > +#define rte_atomic_compare_exchange_strong(PTR, EXP, DES, SUC_MO,
> > > FAIL_MO)  \
> > > +(__extension__ ({    \
> > > +typeof(PTR) _ptr = (PTR);    \
> > > +typeof(*_ptr) _des = (DES);    \
> > > +__atomic_compare_exchange(_ptr, (EXP), &_des, 0,    \
> > > + (SUC_MO), (FAIL_MO));
> >     \
> > > +}))
> > > +
> > > +#define rte_atomic_fetch_add(PTR, VAL, MO)\
> > > +__atomic_fetch_add((PTR), (VAL), (MO)) #define
> > > +rte_atomic_fetch_sub(PTR, VAL, MO)\ __atomic_fetch_sub((PTR),
> > > +(VAL), (MO)) #define rte_atomic_fetch_or(PTR, VAL, MO)\
> > > +__atomic_fetch_or((PTR), (VAL), (MO)) #define
> > > +rte_atomic_fetch_xor(PTR, VAL, MO)\ __atomic_fetch_xor((PTR),
> > > +(VAL), (MO)) #define rte_atomic_fetch_and(PTR, VAL, MO)\
> > > +__atomic_fetch_and((PTR), (VAL), (MO))
> > > +
> > > +#define rte_atomic_add_fetch(PTR, VAL, MO)\
> > > +__atomic_add_fetch((PTR), (VAL), (MO)) #define
> > > +rte_atomic_sub_fetch(PTR, VAL, MO)\ __atomic_sub_fetch((PTR),
> > > +(VAL), (MO)) #define rte_atomic_or_fetch(PTR, VAL, MO)\
> > > +__atomic_or_fetch((PTR), (VAL), (MO)) #define
> > > +rte_atomic_xor_fetch(PTR, VAL, MO)\ __atomic_xor_fetch((PTR),
> > > +(VAL), (MO)) #define rte_atomic_and_fetch(PTR, VAL, MO)\
> > > +__atomic_and_fetch((PTR), (VAL), (MO))
> > > +
> > > +/* Synchronization fence between threads based on
> > > + * the specified memory order.
> > > + */
> > > +#define rte_atomic_thread_fence(MO) __atomic_thread_fence((MO))
> > > +
> > > +#endif /* _RTE_ATOMIC_C11_H_ */
> > > diff --git a/lib/librte_eal/include/meson.build
> > > b/lib/librte_eal/include/meson.build
> > > index bc73ec2..dac1aac 100644
> > > --- a/lib/librte_eal/include/meson.build
> > > +++ b/lib/librte_eal/include/meson.build
> > > @@ -51,6 +51,7 @@ headers += files(
> > >  # special case install the generic headers, since they go in a
> > > subdir  generic_headers = files(  'generic/rte_atomic.h',
> > > +'generic/rte_atomic_c11.h',
> > >  'generic/rte_byteorder.h',
> > >  'generic/rte_cpuflags.h',
> > >  'generic/rte_cycles.h',
> > > --
> > > 2.7.4
> > >
> >
> > Thumbs up for the good function documentation. :-)
> 
> Thank you for your comments.
> 
> Thanks,
> Phil
> 
> >
> >
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> >
> >
> 


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13  8:57             ` Morten Brørup
  2020-05-13 15:30               ` Honnappa Nagarahalli
@ 2020-05-13 19:04               ` Mattias Rönnblom
  2020-05-13 19:40                 ` Honnappa Nagarahalli
  1 sibling, 1 reply; 219+ messages in thread
From: Mattias Rönnblom @ 2020-05-13 19:04 UTC (permalink / raw)
  To: Morten Brørup, Honnappa Nagarahalli, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen

On 2020-05-13 10:57, Morten Brørup wrote:
>> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
>> Sent: Tuesday, May 12, 2020 9:24 PM
>>
>> <snip>
>>
>> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
>>
>> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
>> wrote:
>>
>> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
>>
>>
>> What is the purpose of having rte_atomic at all?
>> Is this level of indirection really helping?
>> [HONNAPPA] (not sure why this email has html format, converted to text
>> format)
>> I believe you meant, why not use the __atomic_xxx built-ins directly?
>> The only reason for now is handling of
>> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent to
>> rte_smp_mb which has an optimized implementation for x86. According to
>> Konstantin, the compiler does not generate optimal code. Wrapping that
>> built-in alone is going to be confusing.
>>
>> The wrappers also allow us to have our own implementation using inline
>> assembly for compilers versions that do not support C11 atomic built-
>> ins. But, I do not know if there is a need to support those versions.
> If I recall correctly, someone mentioned that one (or more) of the aging enterprise Linux distributions don't include a compiler with C11 atomics.
>
> I think Stephen is onto something here...
>
> It is silly to add wrappers like this, if the only purpose is to support compilers and distributions that don't properly support an official C standard which is nearly a decade old. The quality and quantity of the DPDK documentation for these functions (including examples, discussions on Stack Overflow, etc.) will be inferior to the documentation of the standard C11 atomics, which increases the probability of incorrect use.


What's being used in DPDK today, and what's being wrapped here, is not 
standard C11 atomics - it's a bunch of GCC built-ins. Nothing in the __ 
namespace is in the standard. It's reserved for the implementation (e.g. 
compiler).


> And if some compiler generates code that is suboptimal for a user, then it should be the choice of the user to either accept it or use a better compiler. Using a suboptimal compiler will not only affect the user's DPDK applications, but all applications developed by the user. And if he accepts it for his other applications, he will also accept it for his DPDK applications.
>
> We could introduce some sort of marker or standardized comment to indicate when functions only exist for backwards compatibility with ancient compilers and similar, with a reference to documentation describing why. And when the documented preconditions are no longer relevant, e.g. when those particular enterprise Linux distributions become obsolete, these functions become obsolete too, and should be removed. However, getting rid of obsolete cruft will break the ABI. In other words: Added cruft will never be removed again, so think twice before adding.
>
>
> Med venlig hilsen / kind regards
> - Morten Brørup
>
>
>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-12 18:20         ` Stephen Hemminger
  2020-05-12 19:23           ` Honnappa Nagarahalli
@ 2020-05-13 19:25           ` Mattias Rönnblom
  1 sibling, 0 replies; 219+ messages in thread
From: Mattias Rönnblom @ 2020-05-13 19:25 UTC (permalink / raw)
  To: Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	honnappa.nagarahalli, jerinj, ktraynor, konstantin.ananyev,
	maxime.coquelin, olivier.matz, mb, harry.van.haaren,
	erik.g.carrillo, nd

On 2020-05-12 20:20, Stephen Hemminger wrote:
> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <phil.yang@arm.com> wrote:
>> parameter. Signed-off-by: Phil Yang <phil.yang@arm.com 
>> <mailto:phil.yang@arm.com>>
>
>
> What is the purpose of having rte_atomic at all?
> Is this level of indirection really helping?
>

To allow a different implementation than the GCC built-ins, for certain 
functions and architectures. To allow extensions (i.e. atomic functions 
that aren't GCC built-ins) in a clean way. To avoid using GCC built-ins 
directly, both for cosmetic reasons, and that it might cause problem for 
future compilers.




^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13 19:04               ` Mattias Rönnblom
@ 2020-05-13 19:40                 ` Honnappa Nagarahalli
  2020-05-13 20:17                   ` Mattias Rönnblom
  0 siblings, 1 reply; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-13 19:40 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen, Honnappa Nagarahalli, nd

<snip>

> >>
> >> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> >>
> >> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
> >> wrote:
> >>
> >> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> >>
> >>
> >> What is the purpose of having rte_atomic at all?
> >> Is this level of indirection really helping?
> >> [HONNAPPA] (not sure why this email has html format, converted to
> >> text
> >> format)
> >> I believe you meant, why not use the __atomic_xxx built-ins directly?
> >> The only reason for now is handling of
> >> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent
> >> to rte_smp_mb which has an optimized implementation for x86.
> >> According to Konstantin, the compiler does not generate optimal code.
> >> Wrapping that built-in alone is going to be confusing.
> >>
> >> The wrappers also allow us to have our own implementation using
> >> inline assembly for compilers versions that do not support C11 atomic
> >> built- ins. But, I do not know if there is a need to support those versions.
> > If I recall correctly, someone mentioned that one (or more) of the aging
> enterprise Linux distributions don't include a compiler with C11 atomics.
> >
> > I think Stephen is onto something here...
> >
> > It is silly to add wrappers like this, if the only purpose is to support
> compilers and distributions that don't properly support an official C standard
> which is nearly a decade old. The quality and quantity of the DPDK
> documentation for these functions (including examples, discussions on Stack
> Overflow, etc.) will be inferior to the documentation of the standard C11
> atomics, which increases the probability of incorrect use.
> 
> 
> What's being used in DPDK today, and what's being wrapped here, is not
> standard C11 atomics - it's a bunch of GCC built-ins. Nothing in the __
> namespace is in the standard. It's reserved for the implementation (e.g.
> compiler).
I have tried to understand what it mean by 'built-ins', but I have not got a good answer. So, does it mean that the built-in function (same symbol and API interface) may not be available in another C compiler? IMO, this is what matters for DPDK.
Currently, the same built-in functions are available in GCC and Clang.

> 
> 
> > And if some compiler generates code that is suboptimal for a user, then it
> should be the choice of the user to either accept it or use a better compiler.
> Using a suboptimal compiler will not only affect the user's DPDK applications,
> but all applications developed by the user. And if he accepts it for his other
> applications, he will also accept it for his DPDK applications.
> >
> > We could introduce some sort of marker or standardized comment to
> indicate when functions only exist for backwards compatibility with ancient
> compilers and similar, with a reference to documentation describing why. And
> when the documented preconditions are no longer relevant, e.g. when those
> particular enterprise Linux distributions become obsolete, these functions
> become obsolete too, and should be removed. However, getting rid of
> obsolete cruft will break the ABI. In other words: Added cruft will never be
> removed again, so think twice before adding.
> >
> >
> > Med venlig hilsen / kind regards
> > - Morten Brørup
> >
> >
> >


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13 19:40                 ` Honnappa Nagarahalli
@ 2020-05-13 20:17                   ` Mattias Rönnblom
  2020-05-14  8:34                     ` Morten Brørup
  0 siblings, 1 reply; 219+ messages in thread
From: Mattias Rönnblom @ 2020-05-13 20:17 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen

On 2020-05-13 21:40, Honnappa Nagarahalli wrote:
> <snip>
>
>>>> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
>>>>
>>>> On Tue, May 12, 2020 at 4:03 pm, Phil Yang <mailto:phil.yang@arm.com>
>>>> wrote:
>>>>
>>>> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
>>>>
>>>>
>>>> What is the purpose of having rte_atomic at all?
>>>> Is this level of indirection really helping?
>>>> [HONNAPPA] (not sure why this email has html format, converted to
>>>> text
>>>> format)
>>>> I believe you meant, why not use the __atomic_xxx built-ins directly?
>>>> The only reason for now is handling of
>>>> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is equivalent
>>>> to rte_smp_mb which has an optimized implementation for x86.
>>>> According to Konstantin, the compiler does not generate optimal code.
>>>> Wrapping that built-in alone is going to be confusing.
>>>>
>>>> The wrappers also allow us to have our own implementation using
>>>> inline assembly for compilers versions that do not support C11 atomic
>>>> built- ins. But, I do not know if there is a need to support those versions.
>>> If I recall correctly, someone mentioned that one (or more) of the aging
>> enterprise Linux distributions don't include a compiler with C11 atomics.
>>> I think Stephen is onto something here...
>>>
>>> It is silly to add wrappers like this, if the only purpose is to support
>> compilers and distributions that don't properly support an official C standard
>> which is nearly a decade old. The quality and quantity of the DPDK
>> documentation for these functions (including examples, discussions on Stack
>> Overflow, etc.) will be inferior to the documentation of the standard C11
>> atomics, which increases the probability of incorrect use.
>>
>>
>> What's being used in DPDK today, and what's being wrapped here, is not
>> standard C11 atomics - it's a bunch of GCC built-ins. Nothing in the __
>> namespace is in the standard. It's reserved for the implementation (e.g.
>> compiler).
> I have tried to understand what it mean by 'built-ins', but I have not got a good answer. So, does it mean that the built-in function (same symbol and API interface) may not be available in another C compiler? IMO, this is what matters for DPDK.
> Currently, the same built-in functions are available in GCC and Clang.


 From what I understand, "built-ins" is GCC terminology for 
non-standard, implementation-specific intrinsic functions, built into 
the compiler. They all reside in the __* namespace.


Since GCC is the industry standard, other compilers are likely to 
follow, including built-in functions.

>>
>>> And if some compiler generates code that is suboptimal for a user, then it
>> should be the choice of the user to either accept it or use a better compiler.
>> Using a suboptimal compiler will not only affect the user's DPDK applications,
>> but all applications developed by the user. And if he accepts it for his other
>> applications, he will also accept it for his DPDK applications.
>>> We could introduce some sort of marker or standardized comment to
>> indicate when functions only exist for backwards compatibility with ancient
>> compilers and similar, with a reference to documentation describing why. And
>> when the documented preconditions are no longer relevant, e.g. when those
>> particular enterprise Linux distributions become obsolete, these functions
>> become obsolete too, and should be removed. However, getting rid of
>> obsolete cruft will break the ABI. In other words: Added cruft will never be
>> removed again, so think twice before adding.
>>>
>>> Med venlig hilsen / kind regards
>>> - Morten Brørup
>>>
>>>
>>>


^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-13 20:17                   ` Mattias Rönnblom
@ 2020-05-14  8:34                     ` Morten Brørup
  2020-05-14 20:16                       ` Mattias Rönnblom
  0 siblings, 1 reply; 219+ messages in thread
From: Morten Brørup @ 2020-05-14  8:34 UTC (permalink / raw)
  To: Mattias Rönnblom, Honnappa Nagarahalli, Stephen Hemminger,
	Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen, david.marchand, Song Zhu, Gavin Hu,
	Jeff Brownlee, Philippe Robin, Pravin Kantak, Chen, Zhaoyan

+ Added people from the related discussion regarding the ARM roadmap [https://mails.dpdk.org/archives/dev/2020-April/162580.html].

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, May 13, 2020 10:17 PM
> 
> On 2020-05-13 21:40, Honnappa Nagarahalli wrote:
> > <snip>
> >
> >>>> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11
> atomics
> >>>>
> >>>> On Tue, May 12, 2020 at 4:03 pm, Phil Yang
> <mailto:phil.yang@arm.com>
> >>>> wrote:
> >>>>
> >>>> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> >>>>
> >>>>
> >>>> What is the purpose of having rte_atomic at all?
> >>>> Is this level of indirection really helping?
> >>>> [HONNAPPA] (not sure why this email has html format, converted to
> >>>> text
> >>>> format)
> >>>> I believe you meant, why not use the __atomic_xxx built-ins
> directly?
> >>>> The only reason for now is handling of
> >>>> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is
> equivalent
> >>>> to rte_smp_mb which has an optimized implementation for x86.
> >>>> According to Konstantin, the compiler does not generate optimal
> code.
> >>>> Wrapping that built-in alone is going to be confusing.
> >>>>
> >>>> The wrappers also allow us to have our own implementation using
> >>>> inline assembly for compilers versions that do not support C11
> atomic
> >>>> built- ins. But, I do not know if there is a need to support those
> versions.
> >>> If I recall correctly, someone mentioned that one (or more) of the
> aging
> >> enterprise Linux distributions don't include a compiler with C11
> atomics.
> >>> I think Stephen is onto something here...
> >>>
> >>> It is silly to add wrappers like this, if the only purpose is to
> support
> >> compilers and distributions that don't properly support an official
> C standard
> >> which is nearly a decade old. The quality and quantity of the DPDK
> >> documentation for these functions (including examples, discussions
> on Stack
> >> Overflow, etc.) will be inferior to the documentation of the
> standard C11
> >> atomics, which increases the probability of incorrect use.
> >>
> >>
> >> What's being used in DPDK today, and what's being wrapped here, is
> not
> >> standard C11 atomics - it's a bunch of GCC built-ins. Nothing in the
> __
> >> namespace is in the standard. It's reserved for the implementation
> (e.g.
> >> compiler).
> > I have tried to understand what it mean by 'built-ins', but I have
> not got a good answer. So, does it mean that the built-in function
> (same symbol and API interface) may not be available in another C
> compiler? IMO, this is what matters for DPDK.
> > Currently, the same built-in functions are available in GCC and
> Clang.
> 
> 
>  From what I understand, "built-ins" is GCC terminology for
> non-standard, implementation-specific intrinsic functions, built into
> the compiler. They all reside in the __* namespace.
> 
> 
> Since GCC is the industry standard, other compilers are likely to
> follow, including built-in functions.
> 

Timeline:

December 2011: The C11 standard was published [http://www.open-std.org/jtc1/sc22/wg14/www/standards.html].

March 2012: GCC 4.7 was released, introducing the __atomic built-ins [https://gcc.gnu.org/gcc-4.7/changes.html, https://www.gnu.org/software/gcc/gcc-4.7/].

March 2013: GCC 4.8 was released [https://www.gnu.org/software/gcc/gcc-4.8/].

April 2014: GCC 4.9 was released, introducing C11 atomics (incl. <stdatomic.h>) [https://gcc.gnu.org/gcc-4.9/changes.html, https://www.gnu.org/software/gcc/gcc-4.9/].

June 2014: RHEL7 was released [https://access.redhat.com/articles/3078]. (RHEL7 Beta was released in December 2013, which probably explains why the GA release doesn’t include GCC 4.9.)

May 2019 (i.e. one year ago): RHEL8 was released [https://access.redhat.com/articles/3078].


RHEL7 includes GCC 4.8 only [https://access.redhat.com/solutions/19458], and apparently RHEL7 has not been updated to GCC 4.9 with any of its minor releases.

Should the DPDK project be stuck on "industry standard" GCC atomics, unable to use the decade old "official standard" C11 atomics, only because we want to support a six year old enterprise Linux distribution? Red Hat released a new enterprise version a year ago... perhaps it's time for their customers to upgrade, if they want to use the latest and greatest version of DPDK.

Are all the other tools required for building DPDK (in the required versions) included in RHEL7, or do we require developers to install/upgrade any other tools anyway? If so, why not also GCC? DPDK can be used in a cross compilation environment, so we are not requiring RHEL7 users to replace their GCC 4.7 default compiler.


Furthermore, the DPDK Documentation specifies GCC 4.9+ as a system requirement [https://doc.dpdk.org/guides/linux_gsg/sys_reqs.html#compilation-of-the-dpdk]. If we are stuck on GCC 4.8, the documentation should be updated.


> >>
> >>> And if some compiler generates code that is suboptimal for a user,
> then it
> >> should be the choice of the user to either accept it or use a better
> compiler.
> >> Using a suboptimal compiler will not only affect the user's DPDK
> applications,
> >> but all applications developed by the user. And if he accepts it for
> his other
> >> applications, he will also accept it for his DPDK applications.
> >>> We could introduce some sort of marker or standardized comment to
> >> indicate when functions only exist for backwards compatibility with
> ancient
> >> compilers and similar, with a reference to documentation describing
> why. And
> >> when the documented preconditions are no longer relevant, e.g. when
> those
> >> particular enterprise Linux distributions become obsolete, these
> functions
> >> become obsolete too, and should be removed. However, getting rid of
> >> obsolete cruft will break the ABI. In other words: Added cruft will
> never be
> >> removed again, so think twice before adding.

^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-14  8:34                     ` Morten Brørup
@ 2020-05-14 20:16                       ` Mattias Rönnblom
  2020-05-14 21:00                         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 219+ messages in thread
From: Mattias Rönnblom @ 2020-05-14 20:16 UTC (permalink / raw)
  To: Morten Brørup, Honnappa Nagarahalli, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen, david.marchand, Song Zhu, Gavin Hu,
	Jeff Brownlee, Philippe Robin, Pravin Kantak, Chen, Zhaoyan

On 2020-05-14 10:34, Morten Brørup wrote:
> + Added people from the related discussion regarding the ARM roadmap [https://protect2.fireeye.com/v1/url?k=10efdd7b-4e4f1ed2-10ef9de0-86959e472243-b772fef31e4ae6af&q=1&e=e3b0051e-bb23-4a30-84c7-7e5e80f83325&u=https%3A%2F%2Fmails.dpdk.org%2Farchives%2Fdev%2F2020-April%2F162580.html].
>
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, May 13, 2020 10:17 PM
>>
>> On 2020-05-13 21:40, Honnappa Nagarahalli wrote:
>>> <snip>
>>>
>>>>>> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11
>> atomics
>>>>>> On Tue, May 12, 2020 at 4:03 pm, Phil Yang
>> <mailto:phil.yang@arm.com>
>>>>>> wrote:
>>>>>>
>>>>>> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
>>>>>>
>>>>>>
>>>>>> What is the purpose of having rte_atomic at all?
>>>>>> Is this level of indirection really helping?
>>>>>> [HONNAPPA] (not sure why this email has html format, converted to
>>>>>> text
>>>>>> format)
>>>>>> I believe you meant, why not use the __atomic_xxx built-ins
>> directly?
>>>>>> The only reason for now is handling of
>>>>>> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is
>> equivalent
>>>>>> to rte_smp_mb which has an optimized implementation for x86.
>>>>>> According to Konstantin, the compiler does not generate optimal
>> code.
>>>>>> Wrapping that built-in alone is going to be confusing.
>>>>>>
>>>>>> The wrappers also allow us to have our own implementation using
>>>>>> inline assembly for compilers versions that do not support C11
>> atomic
>>>>>> built- ins. But, I do not know if there is a need to support those
>> versions.
>>>>> If I recall correctly, someone mentioned that one (or more) of the
>> aging
>>>> enterprise Linux distributions don't include a compiler with C11
>> atomics.
>>>>> I think Stephen is onto something here...
>>>>>
>>>>> It is silly to add wrappers like this, if the only purpose is to
>> support
>>>> compilers and distributions that don't properly support an official
>> C standard
>>>> which is nearly a decade old. The quality and quantity of the DPDK
>>>> documentation for these functions (including examples, discussions
>> on Stack
>>>> Overflow, etc.) will be inferior to the documentation of the
>> standard C11
>>>> atomics, which increases the probability of incorrect use.
>>>>
>>>>
>>>> What's being used in DPDK today, and what's being wrapped here, is
>> not
>>>> standard C11 atomics - it's a bunch of GCC built-ins. Nothing in the
>> __
>>>> namespace is in the standard. It's reserved for the implementation
>> (e.g.
>>>> compiler).
>>> I have tried to understand what it mean by 'built-ins', but I have
>> not got a good answer. So, does it mean that the built-in function
>> (same symbol and API interface) may not be available in another C
>> compiler? IMO, this is what matters for DPDK.
>>> Currently, the same built-in functions are available in GCC and
>> Clang.
>>
>>
>>   From what I understand, "built-ins" is GCC terminology for
>> non-standard, implementation-specific intrinsic functions, built into
>> the compiler. They all reside in the __* namespace.
>>
>>
>> Since GCC is the industry standard, other compilers are likely to
>> follow, including built-in functions.
>>
> Timeline:
>
> December 2011: The C11 standard was published [https://protect2.fireeye.com/v1/url?k=8e23b012-d08373bb-8e23f089-86959e472243-a2babe7075f8ac38&q=1&e=e3b0051e-bb23-4a30-84c7-7e5e80f83325&u=http%3A%2F%2Fwww.open-std.org%2Fjtc1%2Fsc22%2Fwg14%2Fwww%2Fstandards.html].
>
> March 2012: GCC 4.7 was released, introducing the __atomic built-ins [https://gcc.gnu.org/gcc-4.7/changes.html, https://www.gnu.org/software/gcc/gcc-4.7/].
>
> March 2013: GCC 4.8 was released [https://www.gnu.org/software/gcc/gcc-4.8/].
>
> April 2014: GCC 4.9 was released, introducing C11 atomics (incl. <stdatomic.h>) [https://gcc.gnu.org/gcc-4.9/changes.html, https://www.gnu.org/software/gcc/gcc-4.9/].
>
> June 2014: RHEL7 was released [https://access.redhat.com/articles/3078]. (RHEL7 Beta was released in December 2013, which probably explains why the GA release doesn’t include GCC 4.9.)
>
> May 2019 (i.e. one year ago): RHEL8 was released [https://access.redhat.com/articles/3078].
>
>
> RHEL7 includes GCC 4.8 only [https://access.redhat.com/solutions/19458], and apparently RHEL7 has not been updated to GCC 4.9 with any of its minor releases.
>
> Should the DPDK project be stuck on "industry standard" GCC atomics, unable to use the decade old "official standard" C11 atomics, only because we want to support a six year old enterprise Linux distribution? Red Hat released a new enterprise version a year ago... perhaps it's time for their customers to upgrade, if they want to use the latest and greatest version of DPDK.


Just to be clear - I wasn't arguing for the direct use of GCC built-ins.


The GCC __atomic built-ins (called directly, or via a DPDK wrapper) do 
have some advantages over C11 atomics. One is that GCC supports 128-bit 
atomic operations, on certain architectures. <rte_atomic.h> already has 
a 128-bit compare-exchange. Also, since the GCC built-ins seem not to 
bother with architectures where atomics would be implemented by means of 
a lock, they are a little easier to use than <stdatomic.h>.


> Are all the other tools required for building DPDK (in the required versions) included in RHEL7, or do we require developers to install/upgrade any other tools anyway? If so, why not also GCC? DPDK can be used in a cross compilation environment, so we are not requiring RHEL7 users to replace their GCC 4.7 default compiler.
>
>
> Furthermore, the DPDK Documentation specifies GCC 4.9+ as a system requirement [https://protect2.fireeye.com/v1/url?k=339bad56-6d3b6eff-339bedcd-86959e472243-cb1bf3934c202e3f&q=1&e=e3b0051e-bb23-4a30-84c7-7e5e80f83325&u=https%3A%2F%2Fdoc.dpdk.org%2Fguides%2Flinux_gsg%2Fsys_reqs.html%23compilation-of-the-dpdk]. If we are stuck on GCC 4.8, the documentation should be updated.
>
>
>>>>> And if some compiler generates code that is suboptimal for a user,
>> then it
>>>> should be the choice of the user to either accept it or use a better
>> compiler.
>>>> Using a suboptimal compiler will not only affect the user's DPDK
>> applications,
>>>> but all applications developed by the user. And if he accepts it for
>> his other
>>>> applications, he will also accept it for his DPDK applications.
>>>>> We could introduce some sort of marker or standardized comment to
>>>> indicate when functions only exist for backwards compatibility with
>> ancient
>>>> compilers and similar, with a reference to documentation describing
>> why. And
>>>> when the documented preconditions are no longer relevant, e.g. when
>> those
>>>> particular enterprise Linux distributions become obsolete, these
>> functions
>>>> become obsolete too, and should be removed. However, getting rid of
>>>> obsolete cruft will break the ABI. In other words: Added cruft will
>> never be
>>>> removed again, so think twice before adding.



^ permalink raw reply	[flat|nested] 219+ messages in thread

* Re: [dpdk-dev] [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
  2020-05-14 20:16                       ` Mattias Rönnblom
@ 2020-05-14 21:00                         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 219+ messages in thread
From: Honnappa Nagarahalli @ 2020-05-14 21:00 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, Stephen Hemminger, Phil Yang
  Cc: thomas, dev, bruce.richardson, ferruh.yigit, hemant.agrawal,
	jerinj, ktraynor, konstantin.ananyev, maxime.coquelin,
	olivier.matz, harry.van.haaren, erik.g.carrillo, nd,
	David Christensen, david.marchand, Song Zhu, Gavin Hu,
	Jeff Brownlee, Philippe Robin, Pravin Kantak, Chen, Zhaoyan,
	Honnappa Nagarahalli, nd

<snip>

> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11 atomics
> 
> On 2020-05-14 10:34, Morten Brørup wrote:
> > + Added people from the related discussion regarding the ARM roadmap
> [https://protect2.fireeye.com/v1/url?k=10efdd7b-4e4f1ed2-10ef9de0-
> 86959e472243-b772fef31e4ae6af&q=1&e=e3b0051e-bb23-4a30-84c7-
> 7e5e80f83325&u=https%3A%2F%2Fmails.dpdk.org%2Farchives%2Fdev%2F20
> 20-April%2F162580.html].
> >
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Wednesday, May 13, 2020 10:17 PM
> >>
> >> On 2020-05-13 21:40, Honnappa Nagarahalli wrote:
> >>> <snip>
> >>>
> >>>>>> Subject: Re: [PATCH v4 4/4] eal/atomic: add wrapper for c11
> >> atomics
> >>>>>> On Tue, May 12, 2020 at 4:03 pm, Phil Yang
> >> <mailto:phil.yang@arm.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> parameter. Signed-off-by: Phil Yang <mailto:phil.yang@arm.com>
> >>>>>>
> >>>>>>
> >>>>>> What is the purpose of having rte_atomic at all?
> >>>>>> Is this level of indirection really helping?
> >>>>>> [HONNAPPA] (not sure why this email has html format, converted to
> >>>>>> text
> >>>>>> format)
> >>>>>> I believe you meant, why not use the __atomic_xxx built-ins
> >> directly?
> >>>>>> The only reason for now is handling of
> >>>>>> __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is
> >> equivalent
> >>>>>> to rte_smp_mb which has an optimized implementation for x86.
> >>>>>> According to Konstantin, the compiler does not generate optimal
> >> code.
> >>>>>> Wrapping that built-in alone is going to be confusing.
> >>>>>>
> >>>>>> The wrappers also allow us to have our own implementation using
> >>>>>> inline assembly for compilers versions that do not support C11
> >> atomic
> >>>>>> built- ins. But, I do not know if there is a need to support
> >>>>>> those
> >> versions.
> >>>>> If I recall correctly, someone mentioned that one (or more) of the
> >> aging
> >>>> enterprise Linux distributions don't include a compiler with C11
> >> atomics.
> >>>>> I think Stephen is onto something here...
> >>>>>
> >>>>> It is silly to add wrappers like this, if the only purpose is to
> >> support
> >>>> compilers and distributions that don't properly support an official
> >> C standard
> >>>> which is nearly a decade old. The quality and quantity of the DPDK
> >>>> documentation for these functions (including examples, discussions
> >> on Stack
> >>>> Overflow, etc.) will be inferior to the documentation of the
> >> standard C11
> >>>> atomics, which increases the probability of incorrect use.
> >>>>
> >>>>
> >>>> What's being used in DPDK today, and what's being wrapped here, is
> >> not
> >>>> standard C11 atomics - it's a bunch of GCC built-ins. Nothing in
> >>>> the
> >> __
> >>>> namespace is in the standard. It's reserved for the implementation
> >> (e.g.
> >>>> compiler).
> >>> I have tried to understand what it mean by 'built-ins', but I have
> >> not got a good answer. So, does it mean that the built-in function
> >> (same symbol and API interface) may not be available in another C
> >> compiler? IMO, this is what matters for DPDK.
> >>> Currently, the same built-in functions are available in GCC and
> >> Clang.
> >>
> >>
> >>   From what I understand, "built-ins" is GCC terminology for
> >> non-standard, implementation-specific intrinsic functions, built into
> >> the compiler. They all reside in the __* namespace.
> >>
> >>
> >> Since GCC is the industry standard, other compilers are likely to
> >> follow, including built-in functions.
> >>
> > Timeline:
> >
> > December 2011: The C11 standard was published
> [https://protect2.fireeye.com/v1/url?k=8e23b012-d08373bb-8e23f089-
> 86959e472243-a2babe7075f8ac38&q=1&e=e3b0051e-bb23-4a30-84c7-
> 7e5e80f83325&u=http%3A%2F%2Fwww.open-
> std.org%2Fjtc1%2Fsc22%2Fwg14%2Fwww%2Fstandards.html].
> >
> > March 2012: GCC 4.7 was released, introducing the __atomic built-ins
> [https://gcc.gnu.org/gcc-4.7/changes.html,
> https://www.gnu.org/software/gcc/gcc-4.7/].
> >
> > March 2013: GCC 4.8 was released [https://www.gnu.org/software/gcc/gcc-
> 4.8/].
> >
> > April 2014: GCC 4.9 was released, introducing C11 atomics (incl.
> <stdatomic.h>) [https://gcc.gnu.org/gcc-4.9/changes.html,
> https://www.gnu.org/software/gcc/gcc-4.9/].
> >
> > June 2014: RHEL7 was released
> > [https://access.redhat.com/articles/3078]. (RHEL7 Beta was released in
> > December 2013, which probably explains why the GA release doesn’t
> > include GCC 4.9.)
> >
> > May 2019 (i.e. one year ago): RHEL8 was released
> [https://access.redhat.com/articles/3078].
> >
> >
> > RHEL7 includes GCC 4.8 only [https://access.redhat.com/solutions/19458],
> and apparently RHEL7 has not been updated to GCC 4.9 with any of its minor
> releases.
> >
> > Should the DPDK project be stuck on "industry standard" GCC atomics,
> unable to use the decade old "official standard" C11 atomics, only because
> we want to support a six year old enterprise Linux distribution? Red Hat
> released a new enterprise version a year ago... perhaps it's time for their
> customers to upgrade, if they want to use the latest and greatest version of
> DPDK.
> 
> 
> Just to be clear - I wasn't arguing for the direct use of GCC built-ins.
> 
> 
> The GCC __atomic built-ins (called directly, or via a DPDK wrapper) do have
> some advantages over C11 atomics. One is that GCC supports 128-bit atomic
> operations, on certain architectures. <rte_atomic.h> already has a 128-bit
> compare-exchange. Also, since the GCC built-ins seem not to bother with
> architectures where atomics would be implemented by means of a lock, they
> are a little easier to use than <stdatomic.h>.
IMO, I do not think we should focus on built-ins vs APIs.

1) Built-ins are supported by both GCC and Clang today. If there is a new compiler in the future, most likely it will support these built-ins.
2) I like the fact that the built-ins always require the memory order parameter. stdatomic.h provides some APIs which do not need memory order (just like rte_atomicNN_xxx APIs). This needs us to implement checks in checkpatch script to avoid using such APIs.
3) If we need to replace the built-ins with APIs in the future, it is a simple search and replace.

If the decision to go with built-ins, turns out to be a bad decision, it can be corrected easily.

I think we should focus on the compiler not generating optimal code for __atomic_thread_fence(__ATOMIC_SEQ_CST) for x86. This is the main reason for these wrappers. From what I have seen, DPDK has tried to provide solutions internally for performance issues caused by compilers.
Given that we have provided 'rte_atomic128_cmp_exchange' (provided because both the compilers were not generating the 128b compare-exchange), I would say we should just provide wrapper for '__atomic_thread_fence' built-in.

> 
> 
> > Are all the other tools required for building DPDK (in the required versions)
> included in RHEL7, or do we require developers to install/upgrade any other
> tools anyway? If so, why not also GCC? DPDK can be used in a cross
> compilation environment, so we are not requiring RHEL7 users to replace
> their GCC 4.7 default compiler.
I have not used RHEL7, Intel CI uses RHEL7, may be they can answer.

> >
> >
> > Furthermore, the DPDK Documentation specifies GCC 4.9+ as a system
> requirement [https://protect2.fireeye.com/v1/url?k=339bad56-6d3b6eff-
> 339bedcd-86959e472243-cb1bf3934c202e3f&q=1&e=e3b0051e-bb23-4a30-
> 84c7-
> 7e5e80f83325&u=https%3A%2F%2Fdoc.dpdk.org%2Fguides%2Flinux_gsg%2F
> sys_reqs.html%23compilation-of-the-dpdk]. If we are stuck on GCC 4.8, the
> documentation should be updated.
This is interesting. Then the CI systems should be upgraded to use GCC 4.9+.

> >
> >
> >>>>> And if some compiler generates code that is suboptimal for a user,
> >> then it
> >>>> should be the choice of the user to either accept it or use a
> >>>> better
> >> compiler.
> >>>>