DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal
@ 2020-03-10 17:49 Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
                   ` (10 more replies)
  0 siblings, 11 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [1] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (8):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 8 files changed, 175 insertions(+), 99 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..d20f60c 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (2 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions Phil Yang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (3 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 06/10] service: remove redundant code Phil Yang
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8..a691f5d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -333,7 +333,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -376,10 +376,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -433,7 +433,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -703,7 +703,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -782,7 +782,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -815,7 +815,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -873,7 +873,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -883,7 +883,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 06/10] service: remove redundant code
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (4 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service Phil Yang
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index a691f5d..6990dc2 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -549,21 +549,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -595,19 +584,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (5 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 06/10] service: remove redundant code Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly Phil Yang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6990dc2..b37fc56 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -367,12 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (6 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier Phil Yang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes more costlier than using atomic operations. Assume that the
application passes the right parameters and reduce the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b37fc56..0186024 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -357,7 +357,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -371,7 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -409,24 +409,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service is
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -446,7 +436,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (7 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations Phil Yang
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 0186024..efb3c9f 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -329,7 +329,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -372,11 +373,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -412,11 +422,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service is
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -549,24 +559,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -622,7 +640,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -705,7 +724,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (8 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-10 17:49 ` Phil Yang
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-10 17:49 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index efb3c9f..68542b0 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -176,9 +176,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -190,9 +192,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -261,8 +265,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -278,9 +282,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -298,11 +303,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -313,11 +319,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -439,7 +446,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -450,8 +458,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -660,9 +666,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal
  2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                   ` (9 preceding siblings ...)
  2020-03-10 17:49 ` [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-12  7:44 ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
                     ` (10 more replies)
  10 siblings, 11 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [1] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (8):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 8 files changed, 175 insertions(+), 99 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..fa10deb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int16_t success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (2 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions Phil Yang
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (3 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 06/10] service: remove redundant code Phil Yang
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 7e537b8..a691f5d 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -333,7 +333,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -376,10 +376,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -433,7 +433,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -703,7 +703,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -782,7 +782,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -815,7 +815,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -873,7 +873,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -883,7 +883,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 06/10] service: remove redundant code
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (4 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service Phil Yang
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index a691f5d..6990dc2 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -549,21 +549,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -595,19 +584,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (5 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 06/10] service: remove redundant code Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly Phil Yang
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 6990dc2..b37fc56 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -367,12 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (6 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier Phil Yang
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b37fc56..670f5a9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -357,7 +357,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -371,7 +371,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -409,24 +409,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -446,7 +436,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (7 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations Phil Yang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 670f5a9..96a59b6 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -329,7 +329,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -372,11 +373,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -412,11 +422,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -549,24 +559,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -622,7 +640,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -705,7 +724,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (8 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-12  7:44   ` Phil Yang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  10 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-12  7:44 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 96a59b6..9c02c24 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -176,9 +176,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -190,9 +192,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -261,8 +265,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -278,9 +282,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -298,11 +303,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -313,11 +319,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -439,7 +446,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -450,8 +458,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -660,9 +666,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
                     ` (9 preceding siblings ...)
  2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-17  1:17   ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
                       ` (13 more replies)
  10 siblings, 14 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

DPDK provides generic rte_atomic APIs to do several atomic operations.
These APIs are using the deprecated __sync built-ins and enforce full
memory barriers on aarch64. However, full barriers are not necessary
in many use cases. In order to address such use cases, C language offers
C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
by making use of the memory ordering parameter provided by the user.
Various patches submitted in the past [2] and the patches in this series
indicate significant performance gains on multiple aarch64 CPUs and no
performance loss on x86.

But the existing rte_atomic API implementations cannot be changed as the
APIs do not take the memory ordering parameter. The only choice available
is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
order to make this change, the following steps are proposed:

[1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
APIs (a script is added to flag the usages).
[2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

This patchset contains:
1) the checkpatch script changes to flag rte_atomic API usage in patches.
2) changes to programmer guide describing writing efficient code for aarch64.
3) changes to various libraries to make use of c11 atomic APIs.

We are planning to replicate this idea across all the other libraries,
drivers, examples, test applications. In the next phase, we will add
changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

v3:
add libatomic dependency for 32-bit clang

v2:
1. fix Clang '-Wincompatible-pointer-types' WARNING.
2. fix typos.

Honnappa Nagarahalli (2):
  service: avoid race condition for MT unsafe service
  service: identify service running on another core correctly

Phil Yang (10):
  doc: add generic atomic deprecation section
  devtools: prevent use of rte atomic APIs in future patches
  eal/build: add libatomic dependency for 32-bit clang
  build: remove redundant code
  vhost: optimize broadcast rarp sync with c11 atomic
  ipsec: optimize with c11 atomic for sa outbound sqn update
  service: remove rte prefix from static functions
  service: remove redundant code
  service: optimize with c11 one-way barrier
  service: relax barriers with C11 atomic operations

 devtools/checkpatches.sh                         |   9 ++
 doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
 drivers/event/octeontx/meson.build               |   5 -
 drivers/event/octeontx2/meson.build              |   5 -
 drivers/event/opdl/meson.build                   |   5 -
 lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
 lib/librte_eal/meson.build                       |   6 +
 lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
 lib/librte_ipsec/sa.h                            |   2 +-
 lib/librte_rcu/meson.build                       |   5 -
 lib/librte_vhost/vhost.h                         |   2 +-
 lib/librte_vhost/vhost_user.c                    |   7 +-
 lib/librte_vhost/virtio_net.c                    |  16 ++-
 13 files changed, 181 insertions(+), 119 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches Phil Yang
                       ` (12 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Add deprecating the generic rte_atomic_xx APIs to c11 atomic built-ins
guide and examples.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/prog_guide/writing_efficient_code.rst | 60 +++++++++++++++++++++++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/writing_efficient_code.rst b/doc/guides/prog_guide/writing_efficient_code.rst
index 849f63e..b278bc6 100644
--- a/doc/guides/prog_guide/writing_efficient_code.rst
+++ b/doc/guides/prog_guide/writing_efficient_code.rst
@@ -167,7 +167,13 @@ but with the added cost of lower throughput.
 Locks and Atomic Operations
 ---------------------------
 
-Atomic operations imply a lock prefix before the instruction,
+This section describes some key considerations when using locks and atomic
+operations in the DPDK environment.
+
+Locks
+~~~~~
+
+On x86, atomic operations imply a lock prefix before the instruction,
 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 This has a big impact on performance in a multicore environment.
 
@@ -176,6 +182,58 @@ It can often be replaced by other solutions like per-lcore variables.
 Also, some locking techniques are more efficient than others.
 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 
+Atomic Operations: Use C11 Atomic Built-ins
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+DPDK `generic rte_atomic <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_atomic.h>`_ operations are
+implemented by `__sync built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
+These __sync built-ins result in full barriers on aarch64, which are unnecessary
+in many use cases. They can be replaced by `__atomic built-ins <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_ that
+conform to the C11 memory model and provide finer memory order control.
+
+So replacing the rte_atomic operations with __atomic built-ins might improve
+performance for aarch64 machines. `More details <https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/StateofC11Code.pdf>`_.
+
+Some typical optimization cases are listed below:
+
+Atomicity
+^^^^^^^^^
+
+Some use cases require atomicity alone, the ordering of the memory operations
+does not matter. For example the packets statistics in the `vhost <https://github.com/DPDK/dpdk/blob/v20.02/examples/vhost/main.c#L796>`_ example application.
+
+It just updates the number of transmitted packets, no subsequent logic depends
+on these counters. So the RELAXED memory ordering is sufficient:
+
+.. code-block:: c
+
+    static __rte_always_inline void
+    virtio_xmit(struct vhost_dev *dst_vdev, struct vhost_dev *src_vdev,
+            struct rte_mbuf *m)
+    {
+        ...
+        ...
+        if (enable_stats) {
+            __atomic_add_fetch(&dst_vdev->stats.rx_total_atomic, 1, __ATOMIC_RELAXED);
+            __atomic_add_fetch(&dst_vdev->stats.rx_atomic, ret, __ATOMIC_RELAXED);
+            ...
+        }
+    }
+
+One-way Barrier
+^^^^^^^^^^^^^^^
+
+Some use cases allow for memory reordering in one way while requiring memory
+ordering in the other direction.
+
+For example, the memory operations before the `lock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L66>`_ can move to the
+critical section, but the memory operations in the critical section cannot move
+above the lock. In this case, the full memory barrier in the CAS operation can
+be replaced to ACQUIRE. On the other hand, the memory operations after the
+`unlock <https://github.com/DPDK/dpdk/blob/v20.02/lib/librte_eal/common/include/generic/rte_spinlock.h#L88>`_ can move to the critical section, but the memory operations in the
+critical section cannot move below the unlock. So the full barrier in the STORE
+operation can be replaced with RELEASE.
+
 Coding Considerations
 ---------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
                       ` (11 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

In order to deprecate the rte_atomic APIs, prevent the patches
from using rte_atomic APIs.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 devtools/checkpatches.sh | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 1794468..493f48e 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -61,6 +61,15 @@ check_forbidden_additions() { # <patch>
 		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
 		"$1" || res=1
 
+	# refrain from new additions of 16/32/64 bits rte_atomic_xxx()
+	# multiple folders and expressions are separated by spaces
+	awk -v FOLDERS="lib drivers app examples" \
+		-v EXPRESSIONS="rte_atomic[0-9][0-9]_.*\\\(" \
+		-v RET_ON_FAIL=1 \
+		-v MESSAGE='Using c11 atomic built-ins instead of rte_atomic' \
+		-f $(dirname $(readlink -f $0))/check-forbidden-tokens.awk \
+		"$1" || res=1
+
 	# svg figures must be included with wildcard extension
 	# because of png conversion for pdf docs
 	awk -v FOLDERS='doc' \
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
                       ` (10 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

When compiling with clang on 32-bit platforms, we are missing copies
of 64-bit atomic functions. We can solve this by linking against
libatomic for the drivers and libs which need those atomic ops.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/meson.build | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_eal/meson.build b/lib/librte_eal/meson.build
index 4be5118..3b10eae 100644
--- a/lib/librte_eal/meson.build
+++ b/lib/librte_eal/meson.build
@@ -20,6 +20,12 @@ endif
 if cc.has_function('getentropy', prefix : '#include <unistd.h>')
 	cflags += '-DRTE_LIBEAL_USE_GETENTROPY'
 endif
+
+# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
+if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
+    ext_deps += cc.find_library('atomic')
+endif
+
 sources = common_sources + env_sources
 objs = common_objs + env_objs
 headers = common_headers + env_headers
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 04/12] build: remove redundant code
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (2 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
                       ` (9 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

All these libs and drivers are built upon the eal lib. So when
compiling with clang on 32-bit platforms linking against libatomic
for the eal lib is sufficient. Remove the redundant code.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/event/octeontx/meson.build  | 5 -----
 drivers/event/octeontx2/meson.build | 5 -----
 drivers/event/opdl/meson.build      | 5 -----
 lib/librte_rcu/meson.build          | 5 -----
 4 files changed, 20 deletions(-)

diff --git a/drivers/event/octeontx/meson.build b/drivers/event/octeontx/meson.build
index 73118a4..2b74bb6 100644
--- a/drivers/event/octeontx/meson.build
+++ b/drivers/event/octeontx/meson.build
@@ -11,8 +11,3 @@ sources = files('ssovf_worker.c',
 )
 
 deps += ['common_octeontx', 'mempool_octeontx', 'bus_vdev', 'pmd_octeontx']
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
diff --git a/drivers/event/octeontx2/meson.build b/drivers/event/octeontx2/meson.build
index 56febb8..dfe8fc4 100644
--- a/drivers/event/octeontx2/meson.build
+++ b/drivers/event/octeontx2/meson.build
@@ -20,11 +20,6 @@ if not dpdk_conf.get('RTE_ARCH_64')
 	extra_flags += ['-Wno-int-to-pointer-cast', '-Wno-pointer-to-int-cast']
 endif
 
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
-
 foreach flag: extra_flags
 	if cc.has_argument(flag)
 		cflags += flag
diff --git a/drivers/event/opdl/meson.build b/drivers/event/opdl/meson.build
index e67b164..566462f 100644
--- a/drivers/event/opdl/meson.build
+++ b/drivers/event/opdl/meson.build
@@ -10,8 +10,3 @@ sources = files(
 	'opdl_test.c',
 )
 deps += ['bus_vdev']
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
diff --git a/lib/librte_rcu/meson.build b/lib/librte_rcu/meson.build
index 62920ba..0c2d5a2 100644
--- a/lib/librte_rcu/meson.build
+++ b/lib/librte_rcu/meson.build
@@ -5,8 +5,3 @@ allow_experimental_apis = true
 
 sources = files('rte_rcu_qsbr.c')
 headers = files('rte_rcu_qsbr.h')
-
-# for clang 32-bit compiles we need libatomic for 64-bit atomic ops
-if cc.get_id() == 'clang' and dpdk_conf.get('RTE_ARCH_64') == false
-	ext_deps += cc.find_library('atomic')
-endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (3 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
                       ` (8 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The rarp packet broadcast flag is synchronized with rte_atomic_XX APIs
which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
---
 lib/librte_vhost/vhost.h      |  2 +-
 lib/librte_vhost/vhost_user.c |  7 +++----
 lib/librte_vhost/virtio_net.c | 16 +++++++++-------
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 2087d14..0e22125 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -350,7 +350,7 @@ struct virtio_net {
 	uint32_t		flags;
 	uint16_t		vhost_hlen;
 	/* to tell if we need broadcast rarp packet */
-	rte_atomic16_t		broadcast_rarp;
+	int16_t			broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
 	int			extbuf;
diff --git a/lib/librte_vhost/vhost_user.c b/lib/librte_vhost/vhost_user.c
index bd1be01..857187d 100644
--- a/lib/librte_vhost/vhost_user.c
+++ b/lib/librte_vhost/vhost_user.c
@@ -2145,11 +2145,10 @@ vhost_user_send_rarp(struct virtio_net **pdev, struct VhostUserMsg *msg,
 	 * Set the flag to inject a RARP broadcast packet at
 	 * rte_vhost_dequeue_burst().
 	 *
-	 * rte_smp_wmb() is for making sure the mac is copied
-	 * before the flag is set.
+	 * __ATOMIC_RELEASE ordering is for making sure the mac is
+	 * copied before the flag is set.
 	 */
-	rte_smp_wmb();
-	rte_atomic16_set(&dev->broadcast_rarp, 1);
+	__atomic_store_n(&dev->broadcast_rarp, 1, __ATOMIC_RELEASE);
 	did = dev->vdpa_dev_id;
 	vdpa_dev = rte_vdpa_get_device(did);
 	if (vdpa_dev && vdpa_dev->ops->migration_done)
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 37c47c7..fa10deb 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -2203,6 +2203,7 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	struct virtio_net *dev;
 	struct rte_mbuf *rarp_mbuf = NULL;
 	struct vhost_virtqueue *vq;
+	int16_t success = 1;
 
 	dev = get_device(vid);
 	if (!dev)
@@ -2249,16 +2250,17 @@ rte_vhost_dequeue_burst(int vid, uint16_t queue_id,
 	 *
 	 * broadcast_rarp shares a cacheline in the virtio_net structure
 	 * with some fields that are accessed during enqueue and
-	 * rte_atomic16_cmpset() causes a write if using cmpxchg. This could
-	 * result in false sharing between enqueue and dequeue.
+	 * __atomic_compare_exchange_n causes a write if performed compare
+	 * and exchange. This could result in false sharing between enqueue
+	 * and dequeue.
 	 *
 	 * Prevent unnecessary false sharing by reading broadcast_rarp first
-	 * and only performing cmpset if the read indicates it is likely to
-	 * be set.
+	 * and only performing compare and exchange if the read indicates it
+	 * is likely to be set.
 	 */
-	if (unlikely(rte_atomic16_read(&dev->broadcast_rarp) &&
-			rte_atomic16_cmpset((volatile uint16_t *)
-				&dev->broadcast_rarp.cnt, 1, 0))) {
+	if (unlikely(__atomic_load_n(&dev->broadcast_rarp, __ATOMIC_ACQUIRE) &&
+			__atomic_compare_exchange_n(&dev->broadcast_rarp,
+			&success, 0, 0, __ATOMIC_RELEASE, __ATOMIC_RELAXED))) {
 
 		rarp_mbuf = rte_net_make_rarp_packet(mbuf_pool, &dev->mac);
 		if (rarp_mbuf == NULL) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (4 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-23 18:48       ` Ananyev, Konstantin
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
                       ` (7 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

For SA outbound packets, rte_atomic64_add_return is used to generate
SQN atomically. This introduced an unnecessary full barrier by calling
the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
patch optimized it with c11 atomic and eliminated the expensive barrier
for aarch64.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_ipsec/ipsec_sqn.h | 3 ++-
 lib/librte_ipsec/sa.h        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
index 0c2f76a..e884af7 100644
--- a/lib/librte_ipsec/ipsec_sqn.h
+++ b/lib/librte_ipsec/ipsec_sqn.h
@@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
 
 	n = *num;
 	if (SQN_ATOMIC(sa))
-		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
+		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
+			__ATOMIC_RELAXED);
 	else {
 		sqn = sa->sqn.outb.raw + n;
 		sa->sqn.outb.raw = sqn;
diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
index d22451b..cab9a2e 100644
--- a/lib/librte_ipsec/sa.h
+++ b/lib/librte_ipsec/sa.h
@@ -120,7 +120,7 @@ struct rte_ipsec_sa {
 	 */
 	union {
 		union {
-			rte_atomic64_t atom;
+			uint64_t atom;
 			uint64_t raw;
 		} outb;
 		struct {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (5 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
                       ` (6 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, stable

Fixes: 3cf5eb1546ed ("service: fix and refactor atomic service accesses")
Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index b0b78ba..2117726 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -336,7 +336,7 @@ rte_service_runstate_get(uint32_t id)
 }
 
 static inline void
-rte_service_runner_do_callback(struct rte_service_spec_impl *s,
+service_runner_do_callback(struct rte_service_spec_impl *s,
 			       struct core_state *cs, uint32_t service_idx)
 {
 	void *userdata = s->spec.callback_userdata;
@@ -379,10 +379,10 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 		rte_atomic32_clear(&s->execute_lock);
 	} else
-		rte_service_runner_do_callback(s, cs, i);
+		service_runner_do_callback(s, cs, i);
 
 	return 0;
 }
@@ -436,7 +436,7 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 }
 
 static int32_t
-rte_service_runner_func(void *arg)
+service_runner_func(void *arg)
 {
 	RTE_SET_USED(arg);
 	uint32_t i;
@@ -706,7 +706,7 @@ rte_service_lcore_start(uint32_t lcore)
 	 */
 	lcore_states[lcore].runstate = RUNSTATE_RUNNING;
 
-	int ret = rte_eal_remote_launch(rte_service_runner_func, 0, lcore);
+	int ret = rte_eal_remote_launch(service_runner_func, 0, lcore);
 	/* returns -EBUSY if the core is already launched, 0 on success */
 	return ret;
 }
@@ -785,7 +785,7 @@ rte_service_lcore_attr_get(uint32_t lcore, uint32_t attr_id,
 }
 
 static void
-rte_service_dump_one(FILE *f, struct rte_service_spec_impl *s,
+service_dump_one(FILE *f, struct rte_service_spec_impl *s,
 		     uint64_t all_cycles, uint32_t reset)
 {
 	/* avoid divide by zero */
@@ -818,7 +818,7 @@ rte_service_attr_reset_all(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	int reset = 1;
-	rte_service_dump_one(NULL, s, 0, reset);
+	service_dump_one(NULL, s, 0, reset);
 	return 0;
 }
 
@@ -876,7 +876,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 		fprintf(f, "Service %s Summary\n", s->spec.name);
 		uint32_t reset = 0;
-		rte_service_dump_one(f, s, total_cycles, reset);
+		service_dump_one(f, s, total_cycles, reset);
 		return 0;
 	}
 
@@ -886,7 +886,7 @@ rte_service_dump(FILE *f, uint32_t id)
 		if (!service_valid(i))
 			continue;
 		uint32_t reset = 0;
-		rte_service_dump_one(f, &rte_services[i], total_cycles, reset);
+		service_dump_one(f, &rte_services[i], total_cycles, reset);
 	}
 
 	fprintf(f, "Service Cores Summary\n");
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 08/12] service: remove redundant code
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (6 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
                       ` (5 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Stable

The service id validation is verified in the calling function, remove
the redundant code inside the service_update function.

Fixes: 21698354c832 ("service: introduce service cores concept")
Cc: Stable@dpdk.org

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 2117726..557b5a9 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -552,21 +552,10 @@ rte_service_start_with_defaults(void)
 }
 
 static int32_t
-service_update(struct rte_service_spec *service, uint32_t lcore,
+service_update(uint32_t sid, uint32_t lcore,
 		uint32_t *set, uint32_t *enabled)
 {
-	uint32_t i;
-	int32_t sid = -1;
-
-	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
-		if ((struct rte_service_spec *)&rte_services[i] == service &&
-				service_valid(i)) {
-			sid = i;
-			break;
-		}
-	}
-
-	if (sid == -1 || lcore >= RTE_MAX_LCORE)
+	if (lcore >= RTE_MAX_LCORE)
 		return -EINVAL;
 
 	if (!lcore_states[lcore].is_service_core)
@@ -598,19 +587,23 @@ service_update(struct rte_service_spec *service, uint32_t lcore,
 int32_t
 rte_service_map_lcore_set(uint32_t id, uint32_t lcore, uint32_t enabled)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t on = enabled > 0;
-	return service_update(&s->spec, lcore, &on, 0);
+	return service_update(id, lcore, &on, 0);
 }
 
 int32_t
 rte_service_map_lcore_get(uint32_t id, uint32_t lcore)
 {
-	struct rte_service_spec_impl *s;
-	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
+	/* validate ID, or return error value */
+	if (id >= RTE_SERVICE_NUM_MAX || !service_valid(id))
+		return -EINVAL;
+
 	uint32_t enabled;
-	int ret = service_update(&s->spec, lcore, 0, &enabled);
+	int ret = service_update(id, lcore, 0, &enabled);
 	if (ret == 0)
 		return enabled;
 	return ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (7 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
                       ` (4 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

There has possible that a MT unsafe service might get configured to
run on another core while the service is running currently. This
might result in the MT unsafe service running on multiple cores
simultaneously. Use 'execute_lock' always when the service is
MT unsafe.

Fixes: e9139a32f6e8 ("service: add function to run on app lcore")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 557b5a9..32a2f8a 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -50,6 +50,10 @@ struct rte_service_spec_impl {
 	uint8_t internal_flags;
 
 	/* per service statistics */
+	/* Indicates how many cores the service is mapped to run on.
+	 * It does not indicate the number of cores the service is running
+	 * on currently.
+	 */
 	rte_atomic32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
@@ -370,12 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	/* check do we need cmpset, if MT safe or <= 1 core
-	 * mapped, atomic ops are not required.
-	 */
-	const int use_atomics = (service_mt_safe(s) == 0) &&
-				(rte_atomic32_read(&s->num_mapped_cores) > 1);
-	if (use_atomics) {
+	if (service_mt_safe(s) == 0) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (8 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
                       ` (3 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd, Honnappa Nagarahalli,
	stable

From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

The logic to identify if the MT unsafe service is running on another
core can return -EBUSY spuriously. In such cases, running the service
becomes costlier than using atomic operations. Assume that the
application passes the right parameters and reduces the number of
instructions for all cases.

Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 32a2f8a..0843c3c 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -360,7 +360,7 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
 /* Expects the service 's' is valid. */
 static int32_t
 service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
-	    struct rte_service_spec_impl *s)
+	    struct rte_service_spec_impl *s, uint32_t serialize_mt_unsafe)
 {
 	if (!s)
 		return -EINVAL;
@@ -374,7 +374,7 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 
 	cs->service_active_on_lcore[i] = 1;
 
-	if (service_mt_safe(s) == 0) {
+	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
 		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
 			return -EBUSY;
 
@@ -412,24 +412,14 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
-	/* Atomically add this core to the mapped cores first, then examine if
-	 * we can run the service. This avoids a race condition between
-	 * checking the value, and atomically adding to the mapped count.
+	/* Increment num_mapped_cores to indicate that the service
+	 * is running on a core.
 	 */
-	if (serialize_mt_unsafe)
-		rte_atomic32_inc(&s->num_mapped_cores);
+	rte_atomic32_inc(&s->num_mapped_cores);
 
-	if (service_mt_safe(s) == 0 &&
-			rte_atomic32_read(&s->num_mapped_cores) > 1) {
-		if (serialize_mt_unsafe)
-			rte_atomic32_dec(&s->num_mapped_cores);
-		return -EBUSY;
-	}
-
-	int ret = service_run(id, cs, UINT64_MAX, s);
+	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	if (serialize_mt_unsafe)
-		rte_atomic32_dec(&s->num_mapped_cores);
+	rte_atomic32_dec(&s->num_mapped_cores);
 
 	return ret;
 }
@@ -449,7 +439,7 @@ service_runner_func(void *arg)
 			if (!service_valid(i))
 				continue;
 			/* return value ignored as no change to code flow */
-			service_run(i, cs, service_mask, service_get(i));
+			service_run(i, cs, service_mask, service_get(i), 1);
 		}
 
 		cs->loops++;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (9 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
                       ` (2 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

The num_mapped_cores and execute_lock are synchronized with rte_atomic_XX
APIs which is a full barrier, DMB, on aarch64. This patch optimized it with
c11 atomic one-way barrier.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_eal/common/rte_service.c | 50 ++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index 0843c3c..c033224 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -42,7 +42,7 @@ struct rte_service_spec_impl {
 	 * running this service callback. When not set, a core may take the
 	 * lock and then run the service callback.
 	 */
-	rte_atomic32_t execute_lock;
+	uint32_t execute_lock;
 
 	/* API set/get-able variables */
 	int8_t app_runstate;
@@ -54,7 +54,7 @@ struct rte_service_spec_impl {
 	 * It does not indicate the number of cores the service is running
 	 * on currently.
 	 */
-	rte_atomic32_t num_mapped_cores;
+	int32_t num_mapped_cores;
 	uint64_t calls;
 	uint64_t cycles_spent;
 } __rte_cache_aligned;
@@ -332,7 +332,8 @@ rte_service_runstate_get(uint32_t id)
 	rte_smp_rmb();
 
 	int check_disabled = !(s->internal_flags & SERVICE_F_START_CHECK);
-	int lcore_mapped = (rte_atomic32_read(&s->num_mapped_cores) > 0);
+	int lcore_mapped = (__atomic_load_n(&s->num_mapped_cores,
+					    __ATOMIC_RELAXED) > 0);
 
 	return (s->app_runstate == RUNSTATE_RUNNING) &&
 		(s->comp_runstate == RUNSTATE_RUNNING) &&
@@ -375,11 +376,20 @@ service_run(uint32_t i, struct core_state *cs, uint64_t service_mask,
 	cs->service_active_on_lcore[i] = 1;
 
 	if ((service_mt_safe(s) == 0) && (serialize_mt_unsafe == 1)) {
-		if (!rte_atomic32_cmpset((uint32_t *)&s->execute_lock, 0, 1))
+		uint32_t expected = 0;
+		/* ACQUIRE ordering here is to prevent the callback
+		 * function from hoisting up before the execute_lock
+		 * setting.
+		 */
+		if (!__atomic_compare_exchange_n(&s->execute_lock, &expected, 1,
+			    0, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
 			return -EBUSY;
 
 		service_runner_do_callback(s, cs, i);
-		rte_atomic32_clear(&s->execute_lock);
+		/* RELEASE ordering here is used to pair with ACQUIRE
+		 * above to achieve lock semantic.
+		 */
+		__atomic_store_n(&s->execute_lock, 0, __ATOMIC_RELEASE);
 	} else
 		service_runner_do_callback(s, cs, i);
 
@@ -415,11 +425,11 @@ rte_service_run_iter_on_app_lcore(uint32_t id, uint32_t serialize_mt_unsafe)
 	/* Increment num_mapped_cores to indicate that the service
 	 * is running on a core.
 	 */
-	rte_atomic32_inc(&s->num_mapped_cores);
+	__atomic_add_fetch(&s->num_mapped_cores, 1, __ATOMIC_ACQUIRE);
 
 	int ret = service_run(id, cs, UINT64_MAX, s, serialize_mt_unsafe);
 
-	rte_atomic32_dec(&s->num_mapped_cores);
+	__atomic_sub_fetch(&s->num_mapped_cores, 1, __ATOMIC_RELEASE);
 
 	return ret;
 }
@@ -552,24 +562,32 @@ service_update(uint32_t sid, uint32_t lcore,
 
 	uint64_t sid_mask = UINT64_C(1) << sid;
 	if (set) {
-		uint64_t lcore_mapped = lcore_states[lcore].service_mask &
-			sid_mask;
+		/* When multiple threads try to update the same lcore
+		 * service concurrently, e.g. set lcore map followed
+		 * by clear lcore map, the unsynchronized service_mask
+		 * values have issues on the num_mapped_cores value
+		 * consistency. So we use ACQUIRE ordering to pair with
+		 * the RELEASE ordering to synchronize the service_mask.
+		 */
+		uint64_t lcore_mapped = __atomic_load_n(
+					&lcore_states[lcore].service_mask,
+					__ATOMIC_ACQUIRE) & sid_mask;
 
 		if (*set && !lcore_mapped) {
 			lcore_states[lcore].service_mask |= sid_mask;
-			rte_atomic32_inc(&rte_services[sid].num_mapped_cores);
+			__atomic_add_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 		if (!*set && lcore_mapped) {
 			lcore_states[lcore].service_mask &= ~(sid_mask);
-			rte_atomic32_dec(&rte_services[sid].num_mapped_cores);
+			__atomic_sub_fetch(&rte_services[sid].num_mapped_cores,
+					    1, __ATOMIC_RELEASE);
 		}
 	}
 
 	if (enabled)
 		*enabled = !!(lcore_states[lcore].service_mask & (sid_mask));
 
-	rte_smp_wmb();
-
 	return 0;
 }
 
@@ -625,7 +643,8 @@ rte_service_lcore_reset_all(void)
 		}
 	}
 	for (i = 0; i < RTE_SERVICE_NUM_MAX; i++)
-		rte_atomic32_set(&rte_services[i].num_mapped_cores, 0);
+		__atomic_store_n(&rte_services[i].num_mapped_cores, 0,
+				    __ATOMIC_RELAXED);
 
 	rte_smp_wmb();
 
@@ -708,7 +727,8 @@ rte_service_lcore_stop(uint32_t lcore)
 		int32_t enabled = service_mask & (UINT64_C(1) << i);
 		int32_t service_running = rte_service_runstate_get(i);
 		int32_t only_core = (1 ==
-			rte_atomic32_read(&rte_services[i].num_mapped_cores));
+			__atomic_load_n(&rte_services[i].num_mapped_cores,
+					__ATOMIC_RELAXED));
 
 		/* if the core is mapped, and the service is running, and this
 		 * is the only core that is mapped, the service would cease to
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (10 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
@ 2020-03-17  1:17     ` Phil Yang
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
  2020-04-03  7:23     ` Mattias Rönnblom
  13 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-17  1:17 UTC (permalink / raw)
  To: thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

To guarantee the inter-threads visibility of the shareable domain, it
uses a lot of rte_smp_r/wmb in the service library. This patch relaxed
these barriers for service by using c11 atomic one-way barrier operations.

Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eal/common/rte_service.c | 45 ++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/lib/librte_eal/common/rte_service.c b/lib/librte_eal/common/rte_service.c
index c033224..d31663e 100644
--- a/lib/librte_eal/common/rte_service.c
+++ b/lib/librte_eal/common/rte_service.c
@@ -179,9 +179,11 @@ rte_service_set_stats_enable(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_STATS_ENABLED;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_STATS_ENABLED,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_STATS_ENABLED);
+		__atomic_and_fetch(&s->internal_flags,
+			~(SERVICE_F_STATS_ENABLED), __ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -193,9 +195,11 @@ rte_service_set_runstate_mapped_check(uint32_t id, int32_t enabled)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, 0);
 
 	if (enabled)
-		s->internal_flags |= SERVICE_F_START_CHECK;
+		__atomic_or_fetch(&s->internal_flags, SERVICE_F_START_CHECK,
+			__ATOMIC_RELEASE);
 	else
-		s->internal_flags &= ~(SERVICE_F_START_CHECK);
+		__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_START_CHECK),
+			__ATOMIC_RELEASE);
 
 	return 0;
 }
@@ -264,8 +268,8 @@ rte_service_component_register(const struct rte_service_spec *spec,
 	s->spec = *spec;
 	s->internal_flags |= SERVICE_F_REGISTERED | SERVICE_F_START_CHECK;
 
-	rte_smp_wmb();
-	rte_service_count++;
+	/* make sure the counter update after the state change. */
+	__atomic_add_fetch(&rte_service_count, 1, __ATOMIC_RELEASE);
 
 	if (id_ptr)
 		*id_ptr = free_slot;
@@ -281,9 +285,10 @@ rte_service_component_unregister(uint32_t id)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	rte_service_count--;
-	rte_smp_wmb();
 
-	s->internal_flags &= ~(SERVICE_F_REGISTERED);
+	/* make sure the counter update before the state change. */
+	__atomic_and_fetch(&s->internal_flags, ~(SERVICE_F_REGISTERED),
+			   __ATOMIC_RELEASE);
 
 	/* clear the run-bit in all cores */
 	for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -301,11 +306,12 @@ rte_service_component_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->comp_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->comp_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->comp_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -316,11 +322,12 @@ rte_service_runstate_set(uint32_t id, uint32_t runstate)
 	SERVICE_VALID_GET_OR_ERR_RET(id, s, -EINVAL);
 
 	if (runstate)
-		s->app_runstate = RUNSTATE_RUNNING;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_RUNNING,
+				__ATOMIC_RELEASE);
 	else
-		s->app_runstate = RUNSTATE_STOPPED;
+		__atomic_store_n(&s->app_runstate, RUNSTATE_STOPPED,
+				__ATOMIC_RELEASE);
 
-	rte_smp_wmb();
 	return 0;
 }
 
@@ -442,7 +449,8 @@ service_runner_func(void *arg)
 	const int lcore = rte_lcore_id();
 	struct core_state *cs = &lcore_states[lcore];
 
-	while (lcore_states[lcore].runstate == RUNSTATE_RUNNING) {
+	while (__atomic_load_n(&cs->runstate,
+		    __ATOMIC_ACQUIRE) == RUNSTATE_RUNNING) {
 		const uint64_t service_mask = cs->service_mask;
 
 		for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
@@ -453,8 +461,6 @@ service_runner_func(void *arg)
 		}
 
 		cs->loops++;
-
-		rte_smp_rmb();
 	}
 
 	lcore_config[lcore].state = WAIT;
@@ -663,9 +669,8 @@ rte_service_lcore_add(uint32_t lcore)
 
 	/* ensure that after adding a core the mask and state are defaults */
 	lcore_states[lcore].service_mask = 0;
-	lcore_states[lcore].runstate = RUNSTATE_STOPPED;
-
-	rte_smp_wmb();
+	__atomic_store_n(&lcore_states[lcore].runstate, RUNSTATE_STOPPED,
+			__ATOMIC_RELEASE);
 
 	return rte_eal_wait_lcore(lcore);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (11 preceding siblings ...)
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
@ 2020-03-18 14:01     ` Van Haaren, Harry
  2020-03-18 15:13       ` Thomas Monjalon
  2020-03-20  4:51       ` Honnappa Nagarahalli
  2020-04-03  7:23     ` Mattias Rönnblom
  13 siblings, 2 replies; 52+ messages in thread
From: Van Haaren, Harry @ 2020-03-18 14:01 UTC (permalink / raw)
  To: Phil Yang, thomas, Ananyev, Konstantin, stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Hi Phil & Honnappa,

> -----Original Message-----
> From: Phil Yang <phil.yang@arm.com>
> Sent: Tuesday, March 17, 2020 1:18 AM
> To: thomas@monjalon.net; Van Haaren, Harry <harry.van.haaren@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; ruifeng.wang@arm.com;
> joyce.kong@arm.com; nd@arm.com
> Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
> 
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:
> 
> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.

On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
C11 based atomics model is a more gradual step to achieving the goal, and at that
point add a checkpatch warning for additions of rte_atomic*?

More on [2] in context below.

The above is my point-of-view, of course I'd like more people from the DPDK community
to provide their input too.


> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
> 
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.

About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
enabling in a wider context that just service cores?

I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.

As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).

Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
(DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
performance for the workload.

To put some discussions/solutions on the table:
- Shared Maintainership of a component?
     Split in functionality and C11 atomics implementation
     Obviously there would be collaboration required in such a case.
- Maybe shared maintainership is too much?
     A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?


Hope my concerns are understandable, and of course input/feedback welcomed! -Harry


PS: Apologies for the delay in reply - was OOO on Irish national holiday.


> v3:
> add libatomic dependency for 32-bit clang
> 
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
> 
> Honnappa Nagarahalli (2):
>   service: avoid race condition for MT unsafe service
>   service: identify service running on another core correctly
> 
> Phil Yang (10):
>   doc: add generic atomic deprecation section
>   devtools: prevent use of rte atomic APIs in future patches
>   eal/build: add libatomic dependency for 32-bit clang
>   build: remove redundant code
>   vhost: optimize broadcast rarp sync with c11 atomic
>   ipsec: optimize with c11 atomic for sa outbound sqn update
>   service: remove rte prefix from static functions
>   service: remove redundant code
>   service: optimize with c11 one-way barrier
>   service: relax barriers with C11 atomic operations
> 
>  devtools/checkpatches.sh                         |   9 ++
>  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>  drivers/event/octeontx/meson.build               |   5 -
>  drivers/event/octeontx2/meson.build              |   5 -
>  drivers/event/opdl/meson.build                   |   5 -
>  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> -
>  lib/librte_eal/meson.build                       |   6 +
>  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>  lib/librte_ipsec/sa.h                            |   2 +-
>  lib/librte_rcu/meson.build                       |   5 -
>  lib/librte_vhost/vhost.h                         |   2 +-
>  lib/librte_vhost/vhost_user.c                    |   7 +-
>  lib/librte_vhost/virtio_net.c                    |  16 ++-
>  13 files changed, 181 insertions(+), 119 deletions(-)
> 
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
@ 2020-03-18 15:13       ` Thomas Monjalon
  2020-03-20  5:01         ` Honnappa Nagarahalli
  2020-03-20  4:51       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 52+ messages in thread
From: Thomas Monjalon @ 2020-03-18 15:13 UTC (permalink / raw)
  To: Phil Yang, Van Haaren, Harry, Honnappa.Nagarahalli
  Cc: Ananyev, Konstantin, stephen, maxime.coquelin, dev,
	david.marchand, jerinj, hemant.agrawal, gavin.hu, ruifeng.wang,
	joyce.kong, nd

18/03/2020 15:01, Van Haaren, Harry:
> Hi Phil & Honnappa,
> 
> From: Phil Yang <phil.yang@arm.com>
> > 
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> > by making use of the memory ordering parameter provided by the user.
> > Various patches submitted in the past [2] and the patches in this series
> > indicate significant performance gains on multiple aarch64 CPUs and no
> > performance loss on x86.
> > 
> > But the existing rte_atomic API implementations cannot be changed as the
> > APIs do not take the memory ordering parameter. The only choice available
> > is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> > order to make this change, the following steps are proposed:
> > 
> > [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> > APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch is
> a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at that
> point add a checkpatch warning for additions of rte_atomic*?
> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK community
> to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> > 
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of DPDK. I have some concerns
> over increased complexity of C11 implementation vs the (already complex) rte_atomic implementation today.
> I see other patchsets enabling C11 across other DPDK components, so maybe we should also discuss C11
> enabling in a wider context that just service cores?
> 
> I don't think it fair to expect all developers to be well versed in C11 atomic semantics like
> understanding the complex interactions between the various C11 RELEASE, AQUIRE barriers requires.
> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor of atomic-implementation,
> as it could lead to racey bugs that are likely extremely difficult to track down. (The recent race-on-exit
> has proven the difficulty in reproducing, and that's with an atomics model I'm quite familiar with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but I'd like to discuss how we
> (DPDK community) can best port-to and maintain a complex multi-threaded service library with best-in-class
> performance for the workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug is enough?
> 
> 
> Hope my concerns are understandable, and of course input/feedback welcomed! -Harry

Thanks for raising the issue Harry.

I think we should have at least two official maintainers for C11 atomics in general.
C11 conversion is a progressive effort being done, and should be merged step by step.
If C11 maintainers fail to fix some issues on time, then we can hold the effort.
Does it make sense?



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
  2020-03-18 15:13       ` Thomas Monjalon
@ 2020-03-20  4:51       ` Honnappa Nagarahalli
  2020-03-20 18:32         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 52+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20  4:51 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > These APIs are using the deprecated __sync built-ins and enforce full
> > memory barriers on aarch64. However, full barriers are not necessary
> > in many use cases. In order to address such use cases, C language
> > offers
> > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > control by making use of the memory ordering parameter provided by the
> user.
> > Various patches submitted in the past [2] and the patches in this
> > series indicate significant performance gains on multiple aarch64 CPUs
> > and no performance loss on x86.
> >
> > But the existing rte_atomic API implementations cannot be changed as
> > the APIs do not take the memory ordering parameter. The only choice
> > available is replacing the usage of the rte_atomic APIs with C11
> > atomic APIs. In order to make this change, the following steps are proposed:
> >
> > [1] deprecate rte_atomic APIs so that future patches do not use
> > rte_atomic APIs (a script is added to flag the usages).
> > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> 
> On [1] above, I feel deprecating DPDKs atomic functions and failing checkpatch
> is a bit sudden. Perhaps noting that in a future release (20.11?) DPDK will
> move to a
> C11 based atomics model is a more gradual step to achieving the goal, and at
> that point add a checkpatch warning for additions of rte_atomic*?
We have been working on changing existing usages of rte_atomic APIs in DPDK to use C11 atomics. Usually, the x.11 releases have significant amount of changes (not sure how many would use rte_atomic APIs). I would prefer that in 20.11 no additional code is added using rte_atomics APIs. However, I am open to suggestions on the exact time frame.
Once we decide on the release, I think it makes sense to add a 'warning' in the checkpatch to indicate the deprecation timeline and add an 'error' after the release.

> 
> More on [2] in context below.
> 
> The above is my point-of-view, of course I'd like more people from the DPDK
> community to provide their input too.
> 
> 
> > This patchset contains:
> > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > 3) changes to various libraries to make use of c11 atomic APIs.
> >
> > We are planning to replicate this idea across all the other libraries,
> > drivers, examples, test applications. In the next phase, we will add
> > changes to the mbuf, the EAL interrupts and the event timer adapter
> libraries.
> 
> About ~6/12 patches of this C11 set are targeting the Service Cores area of
> DPDK. I have some concerns over increased complexity of C11 implementation
> vs the (already complex) rte_atomic implementation today.
I agree that it C11 changes are complex, especially if one is starting out to understand what these APIs provide. From my experience, once few underlying concepts are understood, reviewing or making changes do not take too much time.

> I see other patchsets enabling C11 across other DPDK components, so maybe
> we should also discuss C11 enabling in a wider context that just service cores?
Yes, agree. We are in the process of making changes to other areas as well.

> 
> I don't think it fair to expect all developers to be well versed in C11 atomic
> semantics like understanding the complex interactions between the various
> C11 RELEASE, AQUIRE barriers requires.
C11 has been around from sometime now. To my surprise, OVS already uses C11 APIs extensively. VPP has been accepting C11 related changes from past couple of years. Having said that, I agree in general that not everyone is well versed.

> 
> As maintainer of Service Cores I'm hesitant to accept the large-scale refactor
Right now, the patches are split into multiple commits. If required I can host a call to go over simple C11 API usages (sufficient to cover the usage in service core) and the changes in this patch. If you find that particular areas need more understanding I can work on providing additional information such as memory order ladder diagrams. Please let me know what you think.

> of atomic-implementation, as it could lead to racey bugs that are likely
> extremely difficult to track down. (The recent race-on-exit has proven the
> difficulty in reproducing, and that's with an atomics model I'm quite familiar
> with).
> 
> Let me be very clear: I don't wish to block a C11 atomic implementation, but
> I'd like to discuss how we (DPDK community) can best port-to and maintain a
> complex multi-threaded service library with best-in-class performance for the
> workload.
> 
> To put some discussions/solutions on the table:
> - Shared Maintainership of a component?
>      Split in functionality and C11 atomics implementation
>      Obviously there would be collaboration required in such a case.
> - Maybe shared maintainership is too much?
>      A gentlemans/womans agreement of "helping out" with C11 atomics debug
> is enough?
I think shared maintainer ship could be too much as there are many changes. But, I and other engineers from Arm (I would include Arm ecosystem as well) can definitely help out on debug and reviews involving C11 APIs. (I see Thomas's reply on this topic).

> 
> 
> Hope my concerns are understandable, and of course input/feedback
> welcomed! -Harry
No worries at all. We are here to help and make this as easy as possible.

> 
> 
> PS: Apologies for the delay in reply - was OOO on Irish national holiday.
Same here, was on vacation for 3 days.

> 
> 
> > v3:
> > add libatomic dependency for 32-bit clang
> >
> > v2:
> > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > 2. fix typos.
> >
> > Honnappa Nagarahalli (2):
> >   service: avoid race condition for MT unsafe service
> >   service: identify service running on another core correctly
> >
> > Phil Yang (10):
> >   doc: add generic atomic deprecation section
> >   devtools: prevent use of rte atomic APIs in future patches
> >   eal/build: add libatomic dependency for 32-bit clang
> >   build: remove redundant code
> >   vhost: optimize broadcast rarp sync with c11 atomic
> >   ipsec: optimize with c11 atomic for sa outbound sqn update
> >   service: remove rte prefix from static functions
> >   service: remove redundant code
> >   service: optimize with c11 one-way barrier
> >   service: relax barriers with C11 atomic operations
> >
> >  devtools/checkpatches.sh                         |   9 ++
> >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> >  drivers/event/octeontx/meson.build               |   5 -
> >  drivers/event/octeontx2/meson.build              |   5 -
> >  drivers/event/opdl/meson.build                   |   5 -
> >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > -
> >  lib/librte_eal/meson.build                       |   6 +
> >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> >  lib/librte_ipsec/sa.h                            |   2 +-
> >  lib/librte_rcu/meson.build                       |   5 -
> >  lib/librte_vhost/vhost.h                         |   2 +-
> >  lib/librte_vhost/vhost_user.c                    |   7 +-
> >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> >  13 files changed, 181 insertions(+), 119 deletions(-)
> >
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-18 15:13       ` Thomas Monjalon
@ 2020-03-20  5:01         ` Honnappa Nagarahalli
  2020-03-20 12:20           ` Jerin Jacob
  0 siblings, 1 reply; 52+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20  5:01 UTC (permalink / raw)
  To: thomas, Phil Yang, Van Haaren, Harry
  Cc: Ananyev, Konstantin, stephen, maxime.coquelin, dev,
	david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> 18/03/2020 15:01, Van Haaren, Harry:
> > Hi Phil & Honnappa,
> >
> > From: Phil Yang <phil.yang@arm.com>
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by the
> user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event timer
> adapter libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> implementation vs the (already complex) rte_atomic implementation today.
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions between
> the various C11 RELEASE, AQUIRE barriers requires.
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor of atomic-implementation, as it could lead to racey bugs that
> > are likely extremely difficult to track down. (The recent race-on-exit has
> proven the difficulty in reproducing, and that's with an atomics model I'm
> quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library with
> best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> debug is enough?
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> 
> Thanks for raising the issue Harry.
> 
> I think we should have at least two official maintainers for C11 atomics in
> general.
Sure, I can volunteer.

> C11 conversion is a progressive effort being done, and should be merged step
> by step.
Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

> If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> Does it make sense?
I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.

> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20  5:01         ` Honnappa Nagarahalli
@ 2020-03-20 12:20           ` Jerin Jacob
  0 siblings, 0 replies; 52+ messages in thread
From: Jerin Jacob @ 2020-03-20 12:20 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: thomas, Phil Yang, Van Haaren, Harry, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev, david.marchand, jerinj,
	hemant.agrawal, Gavin Hu, Ruifeng Wang, Joyce Kong, nd

On Fri, Mar 20, 2020 at 10:31 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > Subject: Re: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> >
> > 18/03/2020 15:01, Van Haaren, Harry:
> > > Hi Phil & Honnappa,
> > >
> > > From: Phil Yang <phil.yang@arm.com>
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by the
> > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > >
> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event timer
> > adapter libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions between
> > the various C11 RELEASE, AQUIRE barriers requires.
> > >
> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor of atomic-implementation, as it could lead to racey bugs that
> > > are likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model I'm
> > quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library with
> > best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> > >
> > >
> > > Hope my concerns are understandable, and of course input/feedback
> > > welcomed! -Harry
> >
> > Thanks for raising the issue Harry.
> >
> > I think we should have at least two official maintainers for C11 atomics in
> > general.
> Sure, I can volunteer.
>
> > C11 conversion is a progressive effort being done, and should be merged step
> > by step.
> Agree, the changes need to be understood, it is not a search and replace effort. The changes will come-in in stages unless others join the effort.
> The concern I have is about the new patches that get added. I think we need to stop the new patches from using rte_atomic APIs, otherwise we might be making these changes forever.

+1. We must define a time frame otherwise we might be making these
changes forever.

>
> > If C11 maintainers fail to fix some issues on time, then we can hold the effort.
> > Does it make sense?
> I am fine with this approach. But, I think we need to have a deadline in mind to complete the work.
>
> >
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20  4:51       ` Honnappa Nagarahalli
@ 2020-03-20 18:32         ` Honnappa Nagarahalli
  2020-03-27 14:47           ` Van Haaren, Harry
  0 siblings, 1 reply; 52+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-20 18:32 UTC (permalink / raw)
  To: Van Haaren, Harry, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, Carrillo, Erik G, nd, Honnappa Nagarahalli, nd

+ Erik as there are similar changes to timer library

<snip>

> 
> > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > >
> > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > These APIs are using the deprecated __sync built-ins and enforce
> > > full memory barriers on aarch64. However, full barriers are not
> > > necessary in many use cases. In order to address such use cases, C
> > > language offers
> > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > control by making use of the memory ordering parameter provided by
> > > the
> > user.
> > > Various patches submitted in the past [2] and the patches in this
> > > series indicate significant performance gains on multiple aarch64
> > > CPUs and no performance loss on x86.
> > >
> > > But the existing rte_atomic API implementations cannot be changed as
> > > the APIs do not take the memory ordering parameter. The only choice
> > > available is replacing the usage of the rte_atomic APIs with C11
> > > atomic APIs. In order to make this change, the following steps are
> proposed:
> > >
> > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > rte_atomic APIs (a script is added to flag the usages).
> > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
> >
> > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > checkpatch is a bit sudden. Perhaps noting that in a future release
> > (20.11?) DPDK will move to a
> > C11 based atomics model is a more gradual step to achieving the goal,
> > and at that point add a checkpatch warning for additions of rte_atomic*?
> We have been working on changing existing usages of rte_atomic APIs in DPDK
> to use C11 atomics. Usually, the x.11 releases have significant amount of
> changes (not sure how many would use rte_atomic APIs). I would prefer that
> in 20.11 no additional code is added using rte_atomics APIs. However, I am
> open to suggestions on the exact time frame.
> Once we decide on the release, I think it makes sense to add a 'warning' in the
> checkpatch to indicate the deprecation timeline and add an 'error' after the
> release.
> 
> >
> > More on [2] in context below.
> >
> > The above is my point-of-view, of course I'd like more people from the
> > DPDK community to provide their input too.
> >
> >
> > > This patchset contains:
> > > 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> > > 2) changes to programmer guide describing writing efficient code for
> > aarch64.
> > > 3) changes to various libraries to make use of c11 atomic APIs.
> > >
> > > We are planning to replicate this idea across all the other
> > > libraries, drivers, examples, test applications. In the next phase,
> > > we will add changes to the mbuf, the EAL interrupts and the event
> > > timer adapter
> > libraries.
> >
> > About ~6/12 patches of this C11 set are targeting the Service Cores
> > area of DPDK. I have some concerns over increased complexity of C11
> > implementation vs the (already complex) rte_atomic implementation today.
> I agree that it C11 changes are complex, especially if one is starting out to
> understand what these APIs provide. From my experience, once few
> underlying concepts are understood, reviewing or making changes do not take
> too much time.
> 
> > I see other patchsets enabling C11 across other DPDK components, so
> > maybe we should also discuss C11 enabling in a wider context that just
> service cores?
> Yes, agree. We are in the process of making changes to other areas as well.
> 
> >
> > I don't think it fair to expect all developers to be well versed in
> > C11 atomic semantics like understanding the complex interactions
> > between the various
> > C11 RELEASE, AQUIRE barriers requires.
> C11 has been around from sometime now. To my surprise, OVS already uses
> C11 APIs extensively. VPP has been accepting C11 related changes from past
> couple of years. Having said that, I agree in general that not everyone is well
> versed.
> 
> >
> > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > refactor
> Right now, the patches are split into multiple commits. If required I can host a
> call to go over simple C11 API usages (sufficient to cover the usage in service
> core) and the changes in this patch. If you find that particular areas need more
> understanding I can work on providing additional information such as memory
> order ladder diagrams. Please let me know what you think.
When I started working with C11 APIs, I had referred to the following blogs.
https://preshing.com/20120913/acquire-and-release-semantics/
https://preshing.com/20130702/the-happens-before-relation/
https://preshing.com/20130823/the-synchronizes-with-relation/

These will be helpful to understand the changes.

> 
> > of atomic-implementation, as it could lead to racey bugs that are
> > likely extremely difficult to track down. (The recent race-on-exit has
> > proven the difficulty in reproducing, and that's with an atomics model
> > I'm quite familiar with).
> >
> > Let me be very clear: I don't wish to block a C11 atomic
> > implementation, but I'd like to discuss how we (DPDK community) can
> > best port-to and maintain a complex multi-threaded service library
> > with best-in-class performance for the workload.
> >
> > To put some discussions/solutions on the table:
> > - Shared Maintainership of a component?
> >      Split in functionality and C11 atomics implementation
> >      Obviously there would be collaboration required in such a case.
> > - Maybe shared maintainership is too much?
> >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > debug is enough?
> I think shared maintainer ship could be too much as there are many changes.
> But, I and other engineers from Arm (I would include Arm ecosystem as well)
> can definitely help out on debug and reviews involving C11 APIs. (I see
> Thomas's reply on this topic).
> 
> >
> >
> > Hope my concerns are understandable, and of course input/feedback
> > welcomed! -Harry
> No worries at all. We are here to help and make this as easy as possible.
> 
> >
> >
> > PS: Apologies for the delay in reply - was OOO on Irish national holiday.
> Same here, was on vacation for 3 days.
> 
> >
> >
> > > v3:
> > > add libatomic dependency for 32-bit clang
> > >
> > > v2:
> > > 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> > > 2. fix typos.
> > >
> > > Honnappa Nagarahalli (2):
> > >   service: avoid race condition for MT unsafe service
> > >   service: identify service running on another core correctly
> > >
> > > Phil Yang (10):
> > >   doc: add generic atomic deprecation section
> > >   devtools: prevent use of rte atomic APIs in future patches
> > >   eal/build: add libatomic dependency for 32-bit clang
> > >   build: remove redundant code
> > >   vhost: optimize broadcast rarp sync with c11 atomic
> > >   ipsec: optimize with c11 atomic for sa outbound sqn update
> > >   service: remove rte prefix from static functions
> > >   service: remove redundant code
> > >   service: optimize with c11 one-way barrier
> > >   service: relax barriers with C11 atomic operations
> > >
> > >  devtools/checkpatches.sh                         |   9 ++
> > >  doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
> > >  drivers/event/octeontx/meson.build               |   5 -
> > >  drivers/event/octeontx2/meson.build              |   5 -
> > >  drivers/event/opdl/meson.build                   |   5 -
> > >  lib/librte_eal/common/rte_service.c              | 175 ++++++++++++----------
> > > -
> > >  lib/librte_eal/meson.build                       |   6 +
> > >  lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
> > >  lib/librte_ipsec/sa.h                            |   2 +-
> > >  lib/librte_rcu/meson.build                       |   5 -
> > >  lib/librte_vhost/vhost.h                         |   2 +-
> > >  lib/librte_vhost/vhost_user.c                    |   7 +-
> > >  lib/librte_vhost/virtio_net.c                    |  16 ++-
> > >  13 files changed, 181 insertions(+), 119 deletions(-)
> > >
> > > --
> > > 2.7.4
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
@ 2020-03-23 18:48       ` Ananyev, Konstantin
  2020-03-23 19:07         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 52+ messages in thread
From: Ananyev, Konstantin @ 2020-03-23 18:48 UTC (permalink / raw)
  To: Phil Yang, thomas, Van Haaren, Harry, stephen, maxime.coquelin,
	dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

Hi Phil,

> 
> For SA outbound packets, rte_atomic64_add_return is used to generate
> SQN atomically. This introduced an unnecessary full barrier by calling
> the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> patch optimized it with c11 atomic and eliminated the expensive barrier
> for aarch64.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
>  lib/librte_ipsec/sa.h        | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ipsec/ipsec_sqn.h b/lib/librte_ipsec/ipsec_sqn.h
> index 0c2f76a..e884af7 100644
> --- a/lib/librte_ipsec/ipsec_sqn.h
> +++ b/lib/librte_ipsec/ipsec_sqn.h
> @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa, uint32_t *num)
> 
>  	n = *num;
>  	if (SQN_ATOMIC(sa))
> -		sqn = (uint64_t)rte_atomic64_add_return(&sa->sqn.outb.atom, n);
> +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> +			__ATOMIC_RELAXED);

One generic thing to note:
clang for i686 in some cases will generate a proper function call for
64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
Does anyone consider it as a potential problem?
It probably not a big deal, but would like to know broader opinion.

>  	else {
>  		sqn = sa->sqn.outb.raw + n;
>  		sa->sqn.outb.raw = sqn;
> diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h
> index d22451b..cab9a2e 100644
> --- a/lib/librte_ipsec/sa.h
> +++ b/lib/librte_ipsec/sa.h
> @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
>  	 */
>  	union {
>  		union {
> -			rte_atomic64_t atom;
> +			uint64_t atom;
>  			uint64_t raw;
>  		} outb;

If we don't need rte_atomic64 here anymore,
then I think we can collapse the union to just:
uint64_t outb; 

>  		struct {
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 18:48       ` Ananyev, Konstantin
@ 2020-03-23 19:07         ` Honnappa Nagarahalli
  2020-03-23 19:18           ` Ananyev, Konstantin
  0 siblings, 1 reply; 52+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-23 19:07 UTC (permalink / raw)
  To: Ananyev, Konstantin, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> sqn update
> 
> Hi Phil,
> 
> >
> > For SA outbound packets, rte_atomic64_add_return is used to generate
> > SQN atomically. This introduced an unnecessary full barrier by calling
> > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > patch optimized it with c11 atomic and eliminated the expensive
> > barrier for aarch64.
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> >  lib/librte_ipsec/sa.h        | 2 +-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > --- a/lib/librte_ipsec/ipsec_sqn.h
> > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > uint32_t *num)
> >
> >  	n = *num;
> >  	if (SQN_ATOMIC(sa))
> > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> >sqn.outb.atom, n);
> > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > +			__ATOMIC_RELAXED);
> 
> One generic thing to note:
> clang for i686 in some cases will generate a proper function call for 64-bit
> __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> Does anyone consider it as a potential problem?
> It probably not a big deal, but would like to know broader opinion.
I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic instructions for the operand size. Otherwise, gcc generates the instructions directly.
I would think the behavior would be the same for clang.

> 
> >  	else {
> >  		sqn = sa->sqn.outb.raw + n;
> >  		sa->sqn.outb.raw = sqn;
> > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > d22451b..cab9a2e 100644
> > --- a/lib/librte_ipsec/sa.h
> > +++ b/lib/librte_ipsec/sa.h
> > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> >  	 */
> >  	union {
> >  		union {
> > -			rte_atomic64_t atom;
> > +			uint64_t atom;
> >  			uint64_t raw;
> >  		} outb;
> 
> If we don't need rte_atomic64 here anymore, then I think we can collapse the
> union to just:
> uint64_t outb;
> 
> >  		struct {
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:07         ` Honnappa Nagarahalli
@ 2020-03-23 19:18           ` Ananyev, Konstantin
  2020-03-23 20:20             ` Honnappa Nagarahalli
  2020-03-24 10:37             ` Phil Yang
  0 siblings, 2 replies; 52+ messages in thread
From: Ananyev, Konstantin @ 2020-03-23 19:18 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Monday, March 23, 2020 7:08 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren, Harry
> <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> <snip>
> 
> > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound
> > sqn update
> >
> > Hi Phil,
> >
> > >
> > > For SA outbound packets, rte_atomic64_add_return is used to generate
> > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > patch optimized it with c11 atomic and eliminated the expensive
> > > barrier for aarch64.
> > >
> > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > ---
> > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > >  lib/librte_ipsec/sa.h        | 2 +-
> > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > uint32_t *num)
> > >
> > >  	n = *num;
> > >  	if (SQN_ATOMIC(sa))
> > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > >sqn.outb.atom, n);
> > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > +			__ATOMIC_RELAXED);
> >
> > One generic thing to note:
> > clang for i686 in some cases will generate a proper function call for 64-bit
> > __atomic builtins (gcc seems to always generate cmpxchng8b for such cases).
> > Does anyone consider it as a potential problem?
> > It probably not a big deal, but would like to know broader opinion.
> I had looked at this some time back for GCC. The function call is generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions directly.
> I would think the behavior would be the same for clang.

From what I see not really.
As an example:

$ cat tatm11.c
#include <stdint.h>

struct x {
        uint64_t v __attribute__((aligned(8)));
};

uint64_t
ffxadd1(struct x *x, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
}

uint64_t
ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
{
        return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
}

gcc for i686 will generate code with cmpxchng8b for both cases.
clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B aligned,
but will emit a function call for ffxadd11().

> 
> >
> > >  	else {
> > >  		sqn = sa->sqn.outb.raw + n;
> > >  		sa->sqn.outb.raw = sqn;
> > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > d22451b..cab9a2e 100644
> > > --- a/lib/librte_ipsec/sa.h
> > > +++ b/lib/librte_ipsec/sa.h
> > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > >  	 */
> > >  	union {
> > >  		union {
> > > -			rte_atomic64_t atom;
> > > +			uint64_t atom;
> > >  			uint64_t raw;
> > >  		} outb;
> >
> > If we don't need rte_atomic64 here anymore, then I think we can collapse the
> > union to just:
> > uint64_t outb;
> >
> > >  		struct {
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:18           ` Ananyev, Konstantin
@ 2020-03-23 20:20             ` Honnappa Nagarahalli
  2020-03-24 13:10               ` Ananyev, Konstantin
  2020-03-24 10:37             ` Phil Yang
  1 sibling, 1 reply; 52+ messages in thread
From: Honnappa Nagarahalli @ 2020-03-23 20:20 UTC (permalink / raw)
  To: Ananyev, Konstantin, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, Honnappa Nagarahalli, nd

<snip>

> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > generate SQN atomically. This introduced an unnecessary full
> > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > eliminated the expensive barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call
> > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> such cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is
> > generated only if the underlying platform does not support the atomic
> instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8))); };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned, but will emit a function call for ffxadd11().
Does it require libatomic to be linked in this case? Clang documentation calls out unaligned case where it would generate the function call [1].
On aarch64, the atomic instructions need the address to be aligned.

[1] https://clang.llvm.org/docs/Toolchain.html#atomics-library

> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 19:18           ` Ananyev, Konstantin
  2020-03-23 20:20             ` Honnappa Nagarahalli
@ 2020-03-24 10:37             ` Phil Yang
  2020-03-24 11:03               ` Ananyev, Konstantin
  1 sibling, 1 reply; 52+ messages in thread
From: Phil Yang @ 2020-03-24 10:37 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, thomas, Van Haaren,
	Harry, stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd

Hi Konstantin,

<snip>
> >
> > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound
> > > sqn update
> > >
> > > Hi Phil,
> > >
> > > >
> > > > For SA outbound packets, rte_atomic64_add_return is used to
> generate
> > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > barrier for aarch64.
> > > >
> > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > ---
> > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > uint32_t *num)
> > > >
> > > >  	n = *num;
> > > >  	if (SQN_ATOMIC(sa))
> > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > >sqn.outb.atom, n);
> > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > +			__ATOMIC_RELAXED);
> > >
> > > One generic thing to note:
> > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> cases).
> > > Does anyone consider it as a potential problem?
> > > It probably not a big deal, but would like to know broader opinion.
> > I had looked at this some time back for GCC. The function call is generated
> only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> directly.
> > I would think the behavior would be the same for clang.
> 
> From what I see not really.
> As an example:
> 
> $ cat tatm11.c
> #include <stdint.h>
> 
> struct x {
>         uint64_t v __attribute__((aligned(8)));
> };
> 
> uint64_t
> ffxadd1(struct x *x, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> }
> 
> uint64_t
> ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> {
>         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> }
> 
> gcc for i686 will generate code with cmpxchng8b for both cases.
> clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> aligned,
> but will emit a function call for ffxadd11().

I guess your testbed is an i386 platform.  However, what I see here is different.

Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
Both Clang and GCC for i686 generate code with xadd for these two cases.

Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
GCC will generate code with cmpxchng8b for both cases.
Clang generated code emits a function call for both cases.

Thanks,
Phil
> 
> >
> > >
> > > >  	else {
> > > >  		sqn = sa->sqn.outb.raw + n;
> > > >  		sa->sqn.outb.raw = sqn;
> > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > d22451b..cab9a2e 100644
> > > > --- a/lib/librte_ipsec/sa.h
> > > > +++ b/lib/librte_ipsec/sa.h
> > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > >  	 */
> > > >  	union {
> > > >  		union {
> > > > -			rte_atomic64_t atom;
> > > > +			uint64_t atom;
> > > >  			uint64_t raw;
> > > >  		} outb;
> > >
> > > If we don't need rte_atomic64 here anymore, then I think we can
> collapse the
> > > union to just:
> > > uint64_t outb;
> > >
> > > >  		struct {
> > > > --
> > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 10:37             ` Phil Yang
@ 2020-03-24 11:03               ` Ananyev, Konstantin
  2020-03-25  9:38                 ` Phil Yang
  0 siblings, 1 reply; 52+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 11:03 UTC (permalink / raw)
  To: Phil Yang, Honnappa Nagarahalli, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd


Hi Phil,

> <snip>
> > >
> > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > outbound
> > > > sqn update
> > > >
> > > > Hi Phil,
> > > >
> > > > >
> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > generate
> > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64. This
> > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call for 64-bit
> > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is generated
> > only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8)));
> > };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned,
> > but will emit a function call for ffxadd11().
> 
> I guess your testbed is an i386 platform.  However, what I see here is different.
> 
> Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> Both Clang and GCC for i686 generate code with xadd for these two cases.

I suppose you meant x86_64 here (-m64), right?


> 
> Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> GCC will generate code with cmpxchng8b for both cases.
> Clang generated code emits a function call for both cases.

That's exactly what I am talking about above.
X86_64 (64 bit binary) - no function calls for both gcc and clang
i686 (32 bit binary) - no function calls with gcc, functions calls with clang
when explicit alignment is not specified.

As I said in my initial email, that's probably not a big deal -
from what I was told so far we don't officially support clang for IA-32
and I don't know does anyone uses it at all right now.
Though if someone thinks it is a potential problem here -
it is better to flag it at early stage.
So once again my questions to the community:
1/ Does anyone builds/uses DPDK with i686-clang? 
2/ If there are anyone, can these persons try to evaluate
how big perf drop it would cause for them?  
3/ Is there an option to switch to i686-gcc (supported one)?
Konstantin

> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > collapse the
> > > > union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-23 20:20             ` Honnappa Nagarahalli
@ 2020-03-24 13:10               ` Ananyev, Konstantin
  2020-03-24 13:21                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 52+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 13:10 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Van Haaren, Harry,
	stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd


> > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > eliminated the expensive barrier for aarch64.
> > > > >
> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > ---
> > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > uint32_t *num)
> > > > >
> > > > >  	n = *num;
> > > > >  	if (SQN_ATOMIC(sa))
> > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > >sqn.outb.atom, n);
> > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > +			__ATOMIC_RELAXED);
> > > >
> > > > One generic thing to note:
> > > > clang for i686 in some cases will generate a proper function call
> > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > such cases).
> > > > Does anyone consider it as a potential problem?
> > > > It probably not a big deal, but would like to know broader opinion.
> > > I had looked at this some time back for GCC. The function call is
> > > generated only if the underlying platform does not support the atomic
> > instructions for the operand size. Otherwise, gcc generates the instructions
> > directly.
> > > I would think the behavior would be the same for clang.
> >
> > From what I see not really.
> > As an example:
> >
> > $ cat tatm11.c
> > #include <stdint.h>
> >
> > struct x {
> >         uint64_t v __attribute__((aligned(8))); };
> >
> > uint64_t
> > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> >
> > uint64_t
> > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> >
> > gcc for i686 will generate code with cmpxchng8b for both cases.
> > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > aligned, but will emit a function call for ffxadd11().
> Does it require libatomic to be linked in this case? 

Yes, it does.
In fact same story even with current dpdk.org master.
To make i686-native-linuxapp-clang successfully, I have to 
explicitly add EXTRA_LDFLAGS="-latomic".
To be more specific:
$ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
i686-native-linuxapp-clang/lib/librte_distributor.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
U __atomic_load_8 U __atomic_store_8
i686-native-linuxapp-clang/lib/librte_rcu.a
U __atomic_compare_exchange_8 U __atomic_load_8

As there were no complains so far, it makes me think that
probably no-one using clang for IA-32 builds.

> Clang documentation calls out unaligned case where it would generate the function call
> [1].

Seems so, and it treats uin64_t as 4B aligned for IA.
 
> On aarch64, the atomic instructions need the address to be aligned.

For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
Again, as I said before, gcc manages to emit code without function calls
for exactly the same source.

> 
> [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> 
> >
> > >
> > > >
> > > > >  	else {
> > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > >  		sa->sqn.outb.raw = sqn;
> > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > d22451b..cab9a2e 100644
> > > > > --- a/lib/librte_ipsec/sa.h
> > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > >  	 */
> > > > >  	union {
> > > > >  		union {
> > > > > -			rte_atomic64_t atom;
> > > > > +			uint64_t atom;
> > > > >  			uint64_t raw;
> > > > >  		} outb;
> > > >
> > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > collapse the union to just:
> > > > uint64_t outb;
> > > >
> > > > >  		struct {
> > > > > --
> > > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 13:10               ` Ananyev, Konstantin
@ 2020-03-24 13:21                 ` Ananyev, Konstantin
  0 siblings, 0 replies; 52+ messages in thread
From: Ananyev, Konstantin @ 2020-03-24 13:21 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, Phil Yang, thomas,
	Van Haaren, Harry, stephen, maxime.coquelin, dev, Richardson,
	Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Ananyev, Konstantin
> Sent: Tuesday, March 24, 2020 1:10 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Phil Yang <Phil.Yang@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org; maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
> 
> 
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > > > > generate SQN atomically. This introduced an unnecessary full
> > > > > > barrier by calling the '__sync' builtin implemented rte_atomic_XX
> > > > > > API on aarch64. This patch optimized it with c11 atomic and
> > > > > > eliminated the expensive barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call
> > > > > for 64-bit __atomic builtins (gcc seems to always generate cmpxchng8b for
> > > such cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> > > > generated only if the underlying platform does not support the atomic
> > > instructions for the operand size. Otherwise, gcc generates the instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8))); };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED); }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m) {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED); }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned, but will emit a function call for ffxadd11().
> > Does it require libatomic to be linked in this case?
> 
> Yes, it does.
> In fact same story even with current dpdk.org master.
> To make i686-native-linuxapp-clang successfully, I have to
> explicitly add EXTRA_LDFLAGS="-latomic".
> To be more specific:
> $ for i in i686-native-linuxapp-clang/lib/*.a; do x=`nm $i | grep __atomic_`; if [[ -n "${x}" ]]; then echo $i; echo $x; fi; done
> i686-native-linuxapp-clang/lib/librte_distributor.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_pmd_opdl_event.a
> U __atomic_load_8 U __atomic_store_8
> i686-native-linuxapp-clang/lib/librte_rcu.a
> U __atomic_compare_exchange_8 U __atomic_load_8
> 
> As there were no complains so far, it makes me think that
> probably no-one using clang for IA-32 builds.
> 
> > Clang documentation calls out unaligned case where it would generate the function call
> > [1].
> 
> Seems so, and it treats uin64_t as 4B aligned for IA.
correction: for IA-32

> 
> > On aarch64, the atomic instructions need the address to be aligned.
> 
> For that particular case (cmpxchng8b) there is no such restrictions for IA-32.
> Again, as I said before, gcc manages to emit code without function calls
> for exactly the same source.
> 
> >
> > [1] https://clang.llvm.org/docs/Toolchain.html#atomics-library
> >
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > > > collapse the union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update
  2020-03-24 11:03               ` Ananyev, Konstantin
@ 2020-03-25  9:38                 ` Phil Yang
  0 siblings, 0 replies; 52+ messages in thread
From: Phil Yang @ 2020-03-25  9:38 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, thomas, Van Haaren,
	Harry, stephen, maxime.coquelin, dev, Richardson, Bruce
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, nd, nd, nd, nd

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Tuesday, March 24, 2020 7:04 PM
> To: Phil Yang <Phil.Yang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; thomas@monjalon.net; Van Haaren,
> Harry <harry.van.haaren@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: david.marchand@redhat.com; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Joyce Kong <Joyce.Kong@arm.com>; nd
> <nd@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> outbound sqn update
> 
> 
> Hi Phil,
> 
> > <snip>
> > > >
> > > > > Subject: RE: [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa
> > > outbound
> > > > > sqn update
> > > > >
> > > > > Hi Phil,
> > > > >
> > > > > >
> > > > > > For SA outbound packets, rte_atomic64_add_return is used to
> > > generate
> > > > > > SQN atomically. This introduced an unnecessary full barrier by calling
> > > > > > the '__sync' builtin implemented rte_atomic_XX API on aarch64.
> This
> > > > > > patch optimized it with c11 atomic and eliminated the expensive
> > > > > > barrier for aarch64.
> > > > > >
> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ipsec/ipsec_sqn.h | 3 ++-
> > > > > >  lib/librte_ipsec/sa.h        | 2 +-
> > > > > >  2 files changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > b/lib/librte_ipsec/ipsec_sqn.h index 0c2f76a..e884af7 100644
> > > > > > --- a/lib/librte_ipsec/ipsec_sqn.h
> > > > > > +++ b/lib/librte_ipsec/ipsec_sqn.h
> > > > > > @@ -128,7 +128,8 @@ esn_outb_update_sqn(struct rte_ipsec_sa
> *sa,
> > > > > > uint32_t *num)
> > > > > >
> > > > > >  	n = *num;
> > > > > >  	if (SQN_ATOMIC(sa))
> > > > > > -		sqn = (uint64_t)rte_atomic64_add_return(&sa-
> > > > > >sqn.outb.atom, n);
> > > > > > +		sqn = __atomic_add_fetch(&sa->sqn.outb.atom, n,
> > > > > > +			__ATOMIC_RELAXED);
> > > > >
> > > > > One generic thing to note:
> > > > > clang for i686 in some cases will generate a proper function call for 64-
> bit
> > > > > __atomic builtins (gcc seems to always generate cmpxchng8b for such
> > > cases).
> > > > > Does anyone consider it as a potential problem?
> > > > > It probably not a big deal, but would like to know broader opinion.
> > > > I had looked at this some time back for GCC. The function call is
> generated
> > > only if the underlying platform does not support the atomic
> > > > instructions for the operand size. Otherwise, gcc generates the
> instructions
> > > directly.
> > > > I would think the behavior would be the same for clang.
> > >
> > > From what I see not really.
> > > As an example:
> > >
> > > $ cat tatm11.c
> > > #include <stdint.h>
> > >
> > > struct x {
> > >         uint64_t v __attribute__((aligned(8)));
> > > };
> > >
> > > uint64_t
> > > ffxadd1(struct x *x, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(&x->v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > uint64_t
> > > ffxadd11(uint64_t *v, uint32_t n, uint32_t m)
> > > {
> > >         return __atomic_add_fetch(v, n, __ATOMIC_RELAXED);
> > > }
> > >
> > > gcc for i686 will generate code with cmpxchng8b for both cases.
> > > clang will generate cmpxchng8b for ffxadd1() - when data is explicitly 8B
> > > aligned,
> > > but will emit a function call for ffxadd11().
> >
> > I guess your testbed is an i386 platform.  However, what I see here is
> different.
> >
> > Testbed i686:  Ubuntu 18.04.4 LTS/GCC 8.3/ Clang 9.0.0-2
> > Both Clang and GCC for i686 generate code with xadd for these two cases.
> 
> I suppose you meant x86_64 here (-m64), right?

Yes. It is x86_64 here.

> 
> 
> >
> > Testbed i386:  Ubuntu 16.04 LTS (Installed libatomic)/GCC 5.4.0/ Clang 4.0.0
> > GCC will generate code with cmpxchng8b for both cases.
> > Clang generated code emits a function call for both cases.
> 
> That's exactly what I am talking about above.
> X86_64 (64 bit binary) - no function calls for both gcc and clang
> i686 (32 bit binary) - no function calls with gcc, functions calls with clang
> when explicit alignment is not specified.
> 
> As I said in my initial email, that's probably not a big deal -
> from what I was told so far we don't officially support clang for IA-32
> and I don't know does anyone uses it at all right now.
> Though if someone thinks it is a potential problem here -
> it is better to flag it at early stage.
> So once again my questions to the community:
> 1/ Does anyone builds/uses DPDK with i686-clang?
> 2/ If there are anyone, can these persons try to evaluate
> how big perf drop it would cause for them?
> 3/ Is there an option to switch to i686-gcc (supported one)?
> Konstantin
> 
> > >
> > > >
> > > > >
> > > > > >  	else {
> > > > > >  		sqn = sa->sqn.outb.raw + n;
> > > > > >  		sa->sqn.outb.raw = sqn;
> > > > > > diff --git a/lib/librte_ipsec/sa.h b/lib/librte_ipsec/sa.h index
> > > > > > d22451b..cab9a2e 100644
> > > > > > --- a/lib/librte_ipsec/sa.h
> > > > > > +++ b/lib/librte_ipsec/sa.h
> > > > > > @@ -120,7 +120,7 @@ struct rte_ipsec_sa {
> > > > > >  	 */
> > > > > >  	union {
> > > > > >  		union {
> > > > > > -			rte_atomic64_t atom;
> > > > > > +			uint64_t atom;
> > > > > >  			uint64_t raw;
> > > > > >  		} outb;
> > > > >
> > > > > If we don't need rte_atomic64 here anymore, then I think we can
> > > collapse the
> > > > > union to just:
> > > > > uint64_t outb;
> > > > >
> > > > > >  		struct {
> > > > > > --
> > > > > > 2.7.4


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-20 18:32         ` Honnappa Nagarahalli
@ 2020-03-27 14:47           ` Van Haaren, Harry
  0 siblings, 0 replies; 52+ messages in thread
From: Van Haaren, Harry @ 2020-03-27 14:47 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Phil Yang, thomas, Ananyev, Konstantin,
	stephen, maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Gavin Hu, Ruifeng Wang,
	Joyce Kong, Carrillo, Erik G, nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, March 20, 2020 6:32 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Phil Yang
> <Phil.Yang@arm.com>; thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> maxime.coquelin@redhat.com; dev@dpdk.org
> Cc: david.marchand@redhat.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Gavin Hu <Gavin.Hu@arm.com>; Ruifeng Wang <Ruifeng.Wang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Carrillo, Erik G <erik.g.carrillo@intel.com>; nd
> <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>
> Subject: RE: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> 
> + Erik as there are similar changes to timer library
> 
> <snip>
> 
> >
> > > > Subject: [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
> > > >
> > > > DPDK provides generic rte_atomic APIs to do several atomic operations.
> > > > These APIs are using the deprecated __sync built-ins and enforce
> > > > full memory barriers on aarch64. However, full barriers are not
> > > > necessary in many use cases. In order to address such use cases, C
> > > > language offers
> > > > C11 atomic APIs. The C11 atomic APIs provide finer memory barrier
> > > > control by making use of the memory ordering parameter provided by
> > > > the
> > > user.
> > > > Various patches submitted in the past [2] and the patches in this
> > > > series indicate significant performance gains on multiple aarch64
> > > > CPUs and no performance loss on x86.
> > > >
> > > > But the existing rte_atomic API implementations cannot be changed as
> > > > the APIs do not take the memory ordering parameter. The only choice
> > > > available is replacing the usage of the rte_atomic APIs with C11
> > > > atomic APIs. In order to make this change, the following steps are
> > proposed:
> > > >
> > > > [1] deprecate rte_atomic APIs so that future patches do not use
> > > > rte_atomic APIs (a script is added to flag the usages).
> > > > [2] refactor the code that uses rte_atomic APIs to use c11 atomic
> APIs.
> > >
> > > On [1] above, I feel deprecating DPDKs atomic functions and failing
> > > checkpatch is a bit sudden. Perhaps noting that in a future release
> > > (20.11?) DPDK will move to a
> > > C11 based atomics model is a more gradual step to achieving the goal,
> > > and at that point add a checkpatch warning for additions of rte_atomic*?
> > We have been working on changing existing usages of rte_atomic APIs in
> DPDK
> > to use C11 atomics. Usually, the x.11 releases have significant amount of
> > changes (not sure how many would use rte_atomic APIs). I would prefer that
> > in 20.11 no additional code is added using rte_atomics APIs. However, I am
> > open to suggestions on the exact time frame.
> > Once we decide on the release, I think it makes sense to add a 'warning'
> in the
> > checkpatch to indicate the deprecation timeline and add an 'error' after
> the
> > release.

The above sounds reasonable - mainly let's not block any code that exists
or is being developed today using rte_atomic_* APIs from making a release.


> > > More on [2] in context below.
> > >
> > > The above is my point-of-view, of course I'd like more people from the
> > > DPDK community to provide their input too.
> > >
> > >
> > > > This patchset contains:
> > > > 1) the checkpatch script changes to flag rte_atomic API usage in
> patches.
> > > > 2) changes to programmer guide describing writing efficient code for
> > > aarch64.
> > > > 3) changes to various libraries to make use of c11 atomic APIs.
> > > >
> > > > We are planning to replicate this idea across all the other
> > > > libraries, drivers, examples, test applications. In the next phase,
> > > > we will add changes to the mbuf, the EAL interrupts and the event
> > > > timer adapter
> > > libraries.
> > >
> > > About ~6/12 patches of this C11 set are targeting the Service Cores
> > > area of DPDK. I have some concerns over increased complexity of C11
> > > implementation vs the (already complex) rte_atomic implementation today.
> > I agree that it C11 changes are complex, especially if one is starting out
> to
> > understand what these APIs provide. From my experience, once few
> > underlying concepts are understood, reviewing or making changes do not
> take
> > too much time.
> >
> > > I see other patchsets enabling C11 across other DPDK components, so
> > > maybe we should also discuss C11 enabling in a wider context that just
> > service cores?
> > Yes, agree. We are in the process of making changes to other areas as
> well.
> >
> > >
> > > I don't think it fair to expect all developers to be well versed in
> > > C11 atomic semantics like understanding the complex interactions
> > > between the various
> > > C11 RELEASE, AQUIRE barriers requires.
> > C11 has been around from sometime now. To my surprise, OVS already uses
> > C11 APIs extensively. VPP has been accepting C11 related changes from past
> > couple of years. Having said that, I agree in general that not everyone is
> > well versed.

Fair point - like so many things, once familiar with it, it becomes easy :)


> > > As maintainer of Service Cores I'm hesitant to accept the large-scale
> > > refactor
> > Right now, the patches are split into multiple commits. If required I can
> host a
> > call to go over simple C11 API usages (sufficient to cover the usage in
> service
> > core) and the changes in this patch. If you find that particular areas
> need more
> > understanding I can work on providing additional information such as
> memory
> > order ladder diagrams. Please let me know what you think.

Thanks for the offer - I will need to do my due diligence on reiview before
taking up any of your or other C11 folks time.

> When I started working with C11 APIs, I had referred to the following blogs.
> https://preshing.com/20120913/acquire-and-release-semantics/
> https://preshing.com/20130702/the-happens-before-relation/
> https://preshing.com/20130823/the-synchronizes-with-relation/
> 
> These will be helpful to understand the changes.

Thanks, indeed good articles. I found the following slide deck particularly
informative due to the fantastic diagrams (eg, slide 23):
https://mariadb.org/wp-content/uploads/2017/11/2017-11-Memory-barriers.pdf

That said, finding a nice diagram and understanding the implications of
actually using it is different! I hope to properly review the
service-cores patches next week.

> > > of atomic-implementation, as it could lead to racey bugs that are
> > > likely extremely difficult to track down. (The recent race-on-exit has
> > > proven the difficulty in reproducing, and that's with an atomics model
> > > I'm quite familiar with).
> > >
> > > Let me be very clear: I don't wish to block a C11 atomic
> > > implementation, but I'd like to discuss how we (DPDK community) can
> > > best port-to and maintain a complex multi-threaded service library
> > > with best-in-class performance for the workload.
> > >
> > > To put some discussions/solutions on the table:
> > > - Shared Maintainership of a component?
> > >      Split in functionality and C11 atomics implementation
> > >      Obviously there would be collaboration required in such a case.
> > > - Maybe shared maintainership is too much?
> > >      A gentlemans/womans agreement of "helping out" with C11 atomics
> > > debug is enough?
> > I think shared maintainer ship could be too much as there are many
> changes.
> > But, I and other engineers from Arm (I would include Arm ecosystem as
> well)
> > can definitely help out on debug and reviews involving C11 APIs. (I see
> > Thomas's reply on this topic).

Thanks for the offer - as above, ball on my side, I'll go review.

<snip>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal
  2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
                       ` (12 preceding siblings ...)
  2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
@ 2020-04-03  7:23     ` Mattias Rönnblom
  13 siblings, 0 replies; 52+ messages in thread
From: Mattias Rönnblom @ 2020-04-03  7:23 UTC (permalink / raw)
  To: Phil Yang, thomas, harry.van.haaren, konstantin.ananyev, stephen,
	maxime.coquelin, dev
  Cc: david.marchand, jerinj, hemant.agrawal, Honnappa.Nagarahalli,
	gavin.hu, ruifeng.wang, joyce.kong, nd

On 2020-03-17 02:17, Phil Yang wrote:
> DPDK provides generic rte_atomic APIs to do several atomic operations.
> These APIs are using the deprecated __sync built-ins and enforce full
> memory barriers on aarch64. However, full barriers are not necessary
> in many use cases. In order to address such use cases, C language offers
> C11 atomic APIs. The C11 atomic APIs provide finer memory barrier control
> by making use of the memory ordering parameter provided by the user.
> Various patches submitted in the past [2] and the patches in this series
> indicate significant performance gains on multiple aarch64 CPUs and no
> performance loss on x86.
>
> But the existing rte_atomic API implementations cannot be changed as the
> APIs do not take the memory ordering parameter. The only choice available
> is replacing the usage of the rte_atomic APIs with C11 atomic APIs. In
> order to make this change, the following steps are proposed:

First of all I must say I much support the effort of introducing C11 
atomics or something equivalent into DPDK, across the board.


What's being proposed however, is not to use C11 atomics, but rather the 
GCC built-ins designed to allow an efficient C11 atomics implementation. 
The C11 atomic API is found in <stdatomic.h>. Also, the <rte_atomic.h> 
API is not using __sync. It doesn't dictate any particular 
implementation at all.


I don't think directly accessing GCC built-ins across the whole DPDK 
code base sounds like a good idea at all.


Beyond just being plain ugly, and setting a bad precedence, using 
built-ins directly also effectively prevents API extensions. Although 
C11 is new and shiny, I'm sure there will come a day when we want to 
extend this API, to make it easier for consumers and avoid code 
duplication. Some parts of the DPDK code base already today define their 
own __atomic_* functions. Bad idea to use the "__*" namespace, 
especially in a way that has a real risk of future collisions. It's also 
confusing for anyone reading the code, since they are led to believe 
it's a GCC built-in.


Direct calls to GCC built-ins also prevents the use of any other 
implementation than the GCC built-ins, if some ISA or ISA implementation 
would benefit from this. This should be avoided of course, so it's just 
a minor objection.


I think the right way to go about this is not to deprecate 
<rte_atomic.h>. Rather, <rte_atomic.h> should be reshaped into something 
that closely maps to the GCC built-ins for C11 (which seems more 
convenient than real C11 atomics). The parts of <rte_atomic.h> that 
doesn't fit the new model, should be deprecated.


To summarize, I'm not in favor of deprecating <rte_atomic.h>. If we 
should deprecate anything, it's directly accessing compiler built-ins.

> [1] deprecate rte_atomic APIs so that future patches do not use rte_atomic
> APIs (a script is added to flag the usages).
> [2] refactor the code that uses rte_atomic APIs to use c11 atomic APIs.
>
> This patchset contains:
> 1) the checkpatch script changes to flag rte_atomic API usage in patches.
> 2) changes to programmer guide describing writing efficient code for aarch64.
> 3) changes to various libraries to make use of c11 atomic APIs.
>
> We are planning to replicate this idea across all the other libraries,
> drivers, examples, test applications. In the next phase, we will add
> changes to the mbuf, the EAL interrupts and the event timer adapter libraries.
>
> v3:
> add libatomic dependency for 32-bit clang
>
> v2:
> 1. fix Clang '-Wincompatible-pointer-types' WARNING.
> 2. fix typos.
>
> Honnappa Nagarahalli (2):
>    service: avoid race condition for MT unsafe service
>    service: identify service running on another core correctly
>
> Phil Yang (10):
>    doc: add generic atomic deprecation section
>    devtools: prevent use of rte atomic APIs in future patches
>    eal/build: add libatomic dependency for 32-bit clang
>    build: remove redundant code
>    vhost: optimize broadcast rarp sync with c11 atomic
>    ipsec: optimize with c11 atomic for sa outbound sqn update
>    service: remove rte prefix from static functions
>    service: remove redundant code
>    service: optimize with c11 one-way barrier
>    service: relax barriers with C11 atomic operations
>
>   devtools/checkpatches.sh                         |   9 ++
>   doc/guides/prog_guide/writing_efficient_code.rst |  60 +++++++-
>   drivers/event/octeontx/meson.build               |   5 -
>   drivers/event/octeontx2/meson.build              |   5 -
>   drivers/event/opdl/meson.build                   |   5 -
>   lib/librte_eal/common/rte_service.c              | 175 ++++++++++++-----------
>   lib/librte_eal/meson.build                       |   6 +
>   lib/librte_ipsec/ipsec_sqn.h                     |   3 +-
>   lib/librte_ipsec/sa.h                            |   2 +-
>   lib/librte_rcu/meson.build                       |   5 -
>   lib/librte_vhost/vhost.h                         |   2 +-
>   lib/librte_vhost/vhost_user.c                    |   7 +-
>   lib/librte_vhost/virtio_net.c                    |  16 ++-
>   13 files changed, 181 insertions(+), 119 deletions(-)
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, back to index

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10 17:49 [dpdk-dev] [PATCH 00/10] generic rte atomic APIs deprecate proposal Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 01/10] doc: add generic atomic deprecation section Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 05/10] service: remove rte prefix from static functions Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 06/10] service: remove redundant code Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 07/10] service: avoid race condition for MT unsafe service Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 08/10] service: identify service running on another core correctly Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 09/10] service: optimize with c11 one-way barrier Phil Yang
2020-03-10 17:49 ` [dpdk-dev] [PATCH 10/10] service: relax barriers with C11 atomic operations Phil Yang
2020-03-12  7:44 ` [dpdk-dev] [PATCH v2 00/10] generic rte atomic APIs deprecate proposal Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 01/10] doc: add generic atomic deprecation section Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 02/10] devtools: prevent use of rte atomic APIs in future patches Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 03/10] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 04/10] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 05/10] service: remove rte prefix from static functions Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 06/10] service: remove redundant code Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 07/10] service: avoid race condition for MT unsafe service Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 08/10] service: identify service running on another core correctly Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 09/10] service: optimize with c11 one-way barrier Phil Yang
2020-03-12  7:44   ` [dpdk-dev] [PATCH v2 10/10] service: relax barriers with C11 atomic operations Phil Yang
2020-03-17  1:17   ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 01/12] doc: add generic atomic deprecation section Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 02/12] devtools: prevent use of rte atomic APIs in future patches Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 03/12] eal/build: add libatomic dependency for 32-bit clang Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 04/12] build: remove redundant code Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 05/12] vhost: optimize broadcast rarp sync with c11 atomic Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 06/12] ipsec: optimize with c11 atomic for sa outbound sqn update Phil Yang
2020-03-23 18:48       ` Ananyev, Konstantin
2020-03-23 19:07         ` Honnappa Nagarahalli
2020-03-23 19:18           ` Ananyev, Konstantin
2020-03-23 20:20             ` Honnappa Nagarahalli
2020-03-24 13:10               ` Ananyev, Konstantin
2020-03-24 13:21                 ` Ananyev, Konstantin
2020-03-24 10:37             ` Phil Yang
2020-03-24 11:03               ` Ananyev, Konstantin
2020-03-25  9:38                 ` Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 07/12] service: remove rte prefix from static functions Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 08/12] service: remove redundant code Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 09/12] service: avoid race condition for MT unsafe service Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 10/12] service: identify service running on another core correctly Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 11/12] service: optimize with c11 one-way barrier Phil Yang
2020-03-17  1:17     ` [dpdk-dev] [PATCH v3 12/12] service: relax barriers with C11 atomic operations Phil Yang
2020-03-18 14:01     ` [dpdk-dev] [PATCH v3 00/12] generic rte atomic APIs deprecate proposal Van Haaren, Harry
2020-03-18 15:13       ` Thomas Monjalon
2020-03-20  5:01         ` Honnappa Nagarahalli
2020-03-20 12:20           ` Jerin Jacob
2020-03-20  4:51       ` Honnappa Nagarahalli
2020-03-20 18:32         ` Honnappa Nagarahalli
2020-03-27 14:47           ` Van Haaren, Harry
2020-04-03  7:23     ` Mattias Rönnblom

DPDK patches and discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox