- * Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
@ 2019-03-28 18:42 ` Thomas Monjalon
  2019-03-28 18:42   ` Thomas Monjalon
  2019-03-29  1:34   ` Phil Yang (Arm Technology China)
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 0/3] example and test cases optimizations Phil Yang
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Thomas Monjalon @ 2019-03-28 18:42 UTC (permalink / raw)
  To: Phil Yang; +Cc: dev, nd, honnappa.nagarahalli
03/01/2019 10:49, Phil Yang:
> '__sync' builtins are deprecated, use '__atomic' builtins instead for
> packet_ordering.
> 
> Fixes: 850f373 ("examples/packet_ordering: new sample app")
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Am I right there will be a new version of this patch?
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins
  2019-03-28 18:42 ` Thomas Monjalon
@ 2019-03-28 18:42   ` Thomas Monjalon
  2019-03-29  1:34   ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Thomas Monjalon @ 2019-03-28 18:42 UTC (permalink / raw)
  To: Phil Yang; +Cc: dev, nd, honnappa.nagarahalli
03/01/2019 10:49, Phil Yang:
> '__sync' builtins are deprecated, use '__atomic' builtins instead for
> packet_ordering.
> 
> Fixes: 850f373 ("examples/packet_ordering: new sample app")
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Am I right there will be a new version of this patch?
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins
  2019-03-28 18:42 ` Thomas Monjalon
  2019-03-28 18:42   ` Thomas Monjalon
@ 2019-03-29  1:34   ` Phil Yang (Arm Technology China)
  2019-03-29  1:34     ` Phil Yang (Arm Technology China)
  1 sibling, 1 reply; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-03-29  1:34 UTC (permalink / raw)
  To: thomas; +Cc: dev, nd, Honnappa Nagarahalli, nd
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, March 29, 2019 2:43 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with
> atomic builtins
> 
> 03/01/2019 10:49, Phil Yang:
> > '__sync' builtins are deprecated, use '__atomic' builtins instead for
> > packet_ordering.
> >
> > Fixes: 850f373 ("examples/packet_ordering: new sample app")
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> 
> Am I right there will be a new version of this patch?
Hi Thomas,
Yes, there will be a new version of this patch. It is under our internal discussion.
I will upstream it out ASAP.
Thanks,
Phil
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins
  2019-03-29  1:34   ` Phil Yang (Arm Technology China)
@ 2019-03-29  1:34     ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-03-29  1:34 UTC (permalink / raw)
  To: thomas; +Cc: dev, nd, Honnappa Nagarahalli, nd
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, March 29, 2019 2:43 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: Re: [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with
> atomic builtins
> 
> 03/01/2019 10:49, Phil Yang:
> > '__sync' builtins are deprecated, use '__atomic' builtins instead for
> > packet_ordering.
> >
> > Fixes: 850f373 ("examples/packet_ordering: new sample app")
> >
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> 
> Am I right there will be a new version of this patch?
Hi Thomas,
Yes, there will be a new version of this patch. It is under our internal discussion.
I will upstream it out ASAP.
Thanks,
Phil
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread
 
 
- * [dpdk-dev] [PATCH v2 0/3] example and test cases optimizations
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
  2019-03-28 18:42 ` Thomas Monjalon
@ 2019-03-29 10:56 ` Phil Yang
  2019-03-29 10:56   ` Phil Yang
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread Phil Yang
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v1:
Reimplement packet_ordering with __atomic one-way barrier.
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  | 18 +++++++++-
 app/test/test_distributor_perf.c             |  7 +++-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 81 insertions(+), 10 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v2 0/3] example and test cases optimizations
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 0/3] example and test cases optimizations Phil Yang
@ 2019-03-29 10:56   ` Phil Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v1:
Reimplement packet_ordering with __atomic one-way barrier.
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  | 18 +++++++++-
 app/test/test_distributor_perf.c             |  7 +++-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 81 insertions(+), 10 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
- * [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
  2019-03-28 18:42 ` Thomas Monjalon
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 0/3] example and test cases optimizations Phil Yang
@ 2019-03-29 10:56 ` Phil Yang
  2019-03-29 10:56   ` Phil Yang
  2019-03-29 16:39   ` Pattan, Reshma
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using '__sync' built-ins to synchronize
statistics within worker threads. '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics.
Enable by option --insight-worker.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..9c5ceea 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enable output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..8145074 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-03-29 10:56   ` Phil Yang
  2019-03-29 16:39   ` Pattan, Reshma
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using '__sync' built-ins to synchronize
statistics within worker threads. '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics.
Enable by option --insight-worker.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..9c5ceea 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enable output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..8145074 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread Phil Yang
  2019-03-29 10:56   ` Phil Yang
@ 2019-03-29 16:39   ` Pattan, Reshma
  2019-03-29 16:39     ` Pattan, Reshma
  2019-03-30 16:55     ` Phil Yang (Arm Technology China)
  1 sibling, 2 replies; 56+ messages in thread
From: Pattan, Reshma @ 2019-03-29 16:39 UTC (permalink / raw)
  To: Phil Yang, dev, thomas; +Cc: Hunt, David, gavin.hu, honnappa.nagarahalli, nd
> -----Original Message-----
> From: Phil Yang [mailto:phil.yang@arm.com]
> 
> The current implementation using '__sync' built-ins to synchronize statistics
> within worker threads. '__sync' built-ins functions are full barriers which will
> affect the performance, so add a per worker packets statistics.
> 
> Enable by option --insight-worker.
> 
I don't feel the need of this new option to print per core stats.  Any reason for this?
Thanks,
Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-29 16:39   ` Pattan, Reshma
@ 2019-03-29 16:39     ` Pattan, Reshma
  2019-03-30 16:55     ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Pattan, Reshma @ 2019-03-29 16:39 UTC (permalink / raw)
  To: Phil Yang, dev, thomas; +Cc: Hunt, David, gavin.hu, honnappa.nagarahalli, nd
> -----Original Message-----
> From: Phil Yang [mailto:phil.yang@arm.com]
> 
> The current implementation using '__sync' built-ins to synchronize statistics
> within worker threads. '__sync' built-ins functions are full barriers which will
> affect the performance, so add a per worker packets statistics.
> 
> Enable by option --insight-worker.
> 
I don't feel the need of this new option to print per core stats.  Any reason for this?
Thanks,
Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-29 16:39   ` Pattan, Reshma
  2019-03-29 16:39     ` Pattan, Reshma
@ 2019-03-30 16:55     ` Phil Yang (Arm Technology China)
  2019-03-30 16:55       ` Phil Yang (Arm Technology China)
  2019-04-01 12:58       ` Pattan, Reshma
  1 sibling, 2 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-03-30 16:55 UTC (permalink / raw)
  To: Pattan, Reshma, dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Pattan, Reshma <reshma.pattan@intel.com>
> Sent: Saturday, March 30, 2019 12:40 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> 
> 
> > -----Original Message-----
> > From: Phil Yang [mailto:phil.yang@arm.com]
> >
> > The current implementation using '__sync' built-ins to synchronize
> > statistics within worker threads. '__sync' built-ins functions are
> > full barriers which will affect the performance, so add a per worker packets
> statistics.
> >
> > Enable by option --insight-worker.
> >
> 
> I don't feel the need of this new option to print per core stats.  Any reason
> for this?
Hi Reshma,
Thanks for your comment. 
The per core stats aims at removing the '__sync' builtin full barrier in the worker thread. 
It records the workload of each core (It shows the bottleneck core as well).  Since the maximum core number may be more than 128, so disable the print in default and add this new option for debugging use. 
Anyway, if you insist there is no need to print that info out, I can remove the option. But I think using per core stats will benefit performance.
Thanks,
Phil
> 
> Thanks,
> Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-30 16:55     ` Phil Yang (Arm Technology China)
@ 2019-03-30 16:55       ` Phil Yang (Arm Technology China)
  2019-04-01 12:58       ` Pattan, Reshma
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-03-30 16:55 UTC (permalink / raw)
  To: Pattan, Reshma, dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Pattan, Reshma <reshma.pattan@intel.com>
> Sent: Saturday, March 30, 2019 12:40 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> 
> 
> > -----Original Message-----
> > From: Phil Yang [mailto:phil.yang@arm.com]
> >
> > The current implementation using '__sync' built-ins to synchronize
> > statistics within worker threads. '__sync' built-ins functions are
> > full barriers which will affect the performance, so add a per worker packets
> statistics.
> >
> > Enable by option --insight-worker.
> >
> 
> I don't feel the need of this new option to print per core stats.  Any reason
> for this?
Hi Reshma,
Thanks for your comment. 
The per core stats aims at removing the '__sync' builtin full barrier in the worker thread. 
It records the workload of each core (It shows the bottleneck core as well).  Since the maximum core number may be more than 128, so disable the print in default and add this new option for debugging use. 
Anyway, if you insist there is no need to print that info out, I can remove the option. But I think using per core stats will benefit performance.
Thanks,
Phil
> 
> Thanks,
> Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-03-30 16:55     ` Phil Yang (Arm Technology China)
  2019-03-30 16:55       ` Phil Yang (Arm Technology China)
@ 2019-04-01 12:58       ` Pattan, Reshma
  2019-04-01 12:58         ` Pattan, Reshma
  2019-04-02  3:33         ` Phil Yang (Arm Technology China)
  1 sibling, 2 replies; 56+ messages in thread
From: Pattan, Reshma @ 2019-04-01 12:58 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Phil Yang (Arm Technology China) [mailto:Phil.Yang@arm.com]
> Sent: Saturday, March 30, 2019 4:55 PM
> To: Pattan, Reshma <reshma.pattan@intel.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> > -----Original Message-----
> > From: Pattan, Reshma <reshma.pattan@intel.com>
> > Sent: Saturday, March 30, 2019 12:40 AM
> > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>;
> > dev@dpdk.org; thomas@monjalon.net
> > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > worker thread
> >
> >
> >
> > > -----Original Message-----
> > > From: Phil Yang [mailto:phil.yang@arm.com]
> > >
> > > The current implementation using '__sync' built-ins to synchronize
> > > statistics within worker threads. '__sync' built-ins functions are
> > > full barriers which will affect the performance, so add a per worker
> > > packets
> > statistics.
> > >
> > > Enable by option --insight-worker.
> > >
> >
> > I don't feel the need of this new option to print per core stats.  Any
> > reason for this?
> 
> Hi Reshma,
> 
> Thanks for your comment.
> The per core stats aims at removing the '__sync' builtin full barrier in the worker
> thread.
> It records the workload of each core (It shows the bottleneck core as well).
> Since the maximum core number may be more than 128, so disable the print in
> default and add this new option for debugging use.
> 
Ok fine with me then. 
Thanks,
Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-04-01 12:58       ` Pattan, Reshma
@ 2019-04-01 12:58         ` Pattan, Reshma
  2019-04-02  3:33         ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Pattan, Reshma @ 2019-04-01 12:58 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Phil Yang (Arm Technology China) [mailto:Phil.Yang@arm.com]
> Sent: Saturday, March 30, 2019 4:55 PM
> To: Pattan, Reshma <reshma.pattan@intel.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> > -----Original Message-----
> > From: Pattan, Reshma <reshma.pattan@intel.com>
> > Sent: Saturday, March 30, 2019 12:40 AM
> > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>;
> > dev@dpdk.org; thomas@monjalon.net
> > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > worker thread
> >
> >
> >
> > > -----Original Message-----
> > > From: Phil Yang [mailto:phil.yang@arm.com]
> > >
> > > The current implementation using '__sync' built-ins to synchronize
> > > statistics within worker threads. '__sync' built-ins functions are
> > > full barriers which will affect the performance, so add a per worker
> > > packets
> > statistics.
> > >
> > > Enable by option --insight-worker.
> > >
> >
> > I don't feel the need of this new option to print per core stats.  Any
> > reason for this?
> 
> Hi Reshma,
> 
> Thanks for your comment.
> The per core stats aims at removing the '__sync' builtin full barrier in the worker
> thread.
> It records the workload of each core (It shows the bottleneck core as well).
> Since the maximum core number may be more than 128, so disable the print in
> default and add this new option for debugging use.
> 
Ok fine with me then. 
Thanks,
Reshma
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-04-01 12:58       ` Pattan, Reshma
  2019-04-01 12:58         ` Pattan, Reshma
@ 2019-04-02  3:33         ` Phil Yang (Arm Technology China)
  2019-04-02  3:33           ` Phil Yang (Arm Technology China)
  1 sibling, 1 reply; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-02  3:33 UTC (permalink / raw)
  To: Pattan, Reshma, dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd, nd
> -----Original Message-----
> From: Pattan, Reshma <reshma.pattan@intel.com>
> Sent: Monday, April 1, 2019 8:58 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> 
> 
> > -----Original Message-----
> > From: Phil Yang (Arm Technology China) [mailto:Phil.Yang@arm.com]
> > Sent: Saturday, March 30, 2019 4:55 PM
> > To: Pattan, Reshma <reshma.pattan@intel.com>; dev@dpdk.org;
> > thomas@monjalon.net
> > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > worker thread
> >
> > > -----Original Message-----
> > > From: Pattan, Reshma <reshma.pattan@intel.com>
> > > Sent: Saturday, March 30, 2019 12:40 AM
> > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>;
> > > dev@dpdk.org; thomas@monjalon.net
> > > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > > worker thread
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Phil Yang [mailto:phil.yang@arm.com]
> > > >
> > > > The current implementation using '__sync' built-ins to synchronize
> > > > statistics within worker threads. '__sync' built-ins functions are
> > > > full barriers which will affect the performance, so add a per
> > > > worker packets
> > > statistics.
> > > >
> > > > Enable by option --insight-worker.
> > > >
> > >
> > > I don't feel the need of this new option to print per core stats.
> > > Any reason for this?
> >
> > Hi Reshma,
> >
> > Thanks for your comment.
> > The per core stats aims at removing the '__sync' builtin full barrier
> > in the worker thread.
> > It records the workload of each core (It shows the bottleneck core as well).
> > Since the maximum core number may be more than 128, so disable the
> > print in default and add this new option for debugging use.
> >
> 
> Ok fine with me then.
> 
> Thanks,
> Reshma
Thanks for your review.
Thanks,
Phil
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread
  2019-04-02  3:33         ` Phil Yang (Arm Technology China)
@ 2019-04-02  3:33           ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-02  3:33 UTC (permalink / raw)
  To: Pattan, Reshma, dev, thomas
  Cc: Hunt, David, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd, nd
> -----Original Message-----
> From: Pattan, Reshma <reshma.pattan@intel.com>
> Sent: Monday, April 1, 2019 8:58 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each worker
> thread
> 
> 
> 
> > -----Original Message-----
> > From: Phil Yang (Arm Technology China) [mailto:Phil.Yang@arm.com]
> > Sent: Saturday, March 30, 2019 4:55 PM
> > To: Pattan, Reshma <reshma.pattan@intel.com>; dev@dpdk.org;
> > thomas@monjalon.net
> > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > worker thread
> >
> > > -----Original Message-----
> > > From: Pattan, Reshma <reshma.pattan@intel.com>
> > > Sent: Saturday, March 30, 2019 12:40 AM
> > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>;
> > > dev@dpdk.org; thomas@monjalon.net
> > > Cc: Hunt, David <david.hunt@intel.com>; Gavin Hu (Arm Technology
> > > China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > > Subject: RE: [PATCH v2 1/3] packet_ordering: add statistics for each
> > > worker thread
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Phil Yang [mailto:phil.yang@arm.com]
> > > >
> > > > The current implementation using '__sync' built-ins to synchronize
> > > > statistics within worker threads. '__sync' built-ins functions are
> > > > full barriers which will affect the performance, so add a per
> > > > worker packets
> > > statistics.
> > > >
> > > > Enable by option --insight-worker.
> > > >
> > >
> > > I don't feel the need of this new option to print per core stats.
> > > Any reason for this?
> >
> > Hi Reshma,
> >
> > Thanks for your comment.
> > The per core stats aims at removing the '__sync' builtin full barrier
> > in the worker thread.
> > It records the workload of each core (It shows the bottleneck core as well).
> > Since the maximum core number may be more than 128, so disable the
> > print in default and add this new option for debugging use.
> >
> 
> Ok fine with me then.
> 
> Thanks,
> Reshma
Thanks for your review.
Thanks,
Phil
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
 
 
 
 
- * [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (2 preceding siblings ...)
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-03-29 10:56 ` Phil Yang
  2019-03-29 10:56   ` Phil Yang
  2019-04-01 16:24   ` Honnappa Nagarahalli
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 3/3] test/ring_perf: " Phil Yang
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_distributor.c      | 18 +++++++++++++++++-
 app/test/test_distributor_perf.c |  7 ++++++-
 2 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..ddab08d 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,9 +62,14 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
 	int i;
 
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
+	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
+
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
 	num = rte_distributor_get_pkt(db, id, buf, buf, num);
@@ -270,7 +275,12 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
 	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
+#else
 	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..9367460 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,9 +111,14 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
+	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
+
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-03-29 10:56   ` Phil Yang
  2019-04-01 16:24   ` Honnappa Nagarahalli
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_distributor.c      | 18 +++++++++++++++++-
 app/test/test_distributor_perf.c |  7 ++++++-
 2 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..ddab08d 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,9 +62,14 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
 	int i;
 
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
+	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
+
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
 	num = rte_distributor_get_pkt(db, id, buf, buf, num);
@@ -270,7 +275,12 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
 	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
+#else
 	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..9367460 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,9 +111,14 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
+#ifdef RTE_USE_C11_MEM_MODEL
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
+#else
+	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+#endif
+
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
  2019-03-29 10:56   ` Phil Yang
@ 2019-04-01 16:24   ` Honnappa Nagarahalli
  2019-04-01 16:24     ` Honnappa Nagarahalli
  2019-04-02  3:43     ` Phil Yang (Arm Technology China)
  1 sibling, 2 replies; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-01 16:24 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, Honnappa Nagarahalli, nd
> 
> diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c index
> 98919ec..ddab08d 100644
> --- a/app/test/test_distributor.c
> +++ b/app/test/test_distributor.c
> @@ -62,9 +62,14 @@ handle_work(void *arg)
>  	struct worker_params *wp = arg;
>  	struct rte_distributor *db = wp->dist;
>  	unsigned int count = 0, num = 0;
> -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
>  	int i;
> 
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
> +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> +
I suggest we remove the conditional compilation and just keep the __atomic_xxx calls as this is test code. More over the distributor library does not have a C11 version of the library (assuming that using __atomic_xxx does not impact performance on other platforms negatively). This applies to other instances in this patch.
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
>  	num = rte_distributor_get_pkt(db, id, buf, buf, num); @@ -270,7
> +275,12 @@ handle_work_with_free_mbufs(void *arg)
>  	unsigned int count = 0;
>  	unsigned int i;
>  	unsigned int num = 0;
> +
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
>  	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> +#endif
> 
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
> @@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
>  	unsigned int total = 0;
>  	unsigned int i;
>  	unsigned int returned = 0;
> +
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +			__ATOMIC_RELAXED);
> +#else
>  	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> +#endif
> 
>  	num = rte_distributor_get_pkt(d, id, buf, buf, num);
> 
> diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
> index edf1998..9367460 100644
> --- a/app/test/test_distributor_perf.c
> +++ b/app/test/test_distributor_perf.c
> @@ -111,9 +111,14 @@ handle_work(void *arg)
>  	unsigned int count = 0;
>  	unsigned int num = 0;
>  	int i;
> -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
>  	struct rte_mbuf *buf[8] __rte_cache_aligned;
> 
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
> +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> +
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
> 
> --
> 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-01 16:24   ` Honnappa Nagarahalli
@ 2019-04-01 16:24     ` Honnappa Nagarahalli
  2019-04-02  3:43     ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-01 16:24 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, Honnappa Nagarahalli, nd
> 
> diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c index
> 98919ec..ddab08d 100644
> --- a/app/test/test_distributor.c
> +++ b/app/test/test_distributor.c
> @@ -62,9 +62,14 @@ handle_work(void *arg)
>  	struct worker_params *wp = arg;
>  	struct rte_distributor *db = wp->dist;
>  	unsigned int count = 0, num = 0;
> -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
>  	int i;
> 
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
> +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> +
I suggest we remove the conditional compilation and just keep the __atomic_xxx calls as this is test code. More over the distributor library does not have a C11 version of the library (assuming that using __atomic_xxx does not impact performance on other platforms negatively). This applies to other instances in this patch.
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
>  	num = rte_distributor_get_pkt(db, id, buf, buf, num); @@ -270,7
> +275,12 @@ handle_work_with_free_mbufs(void *arg)
>  	unsigned int count = 0;
>  	unsigned int i;
>  	unsigned int num = 0;
> +
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
>  	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> +#endif
> 
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
> @@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
>  	unsigned int total = 0;
>  	unsigned int i;
>  	unsigned int returned = 0;
> +
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +			__ATOMIC_RELAXED);
> +#else
>  	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> +#endif
> 
>  	num = rte_distributor_get_pkt(d, id, buf, buf, num);
> 
> diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
> index edf1998..9367460 100644
> --- a/app/test/test_distributor_perf.c
> +++ b/app/test/test_distributor_perf.c
> @@ -111,9 +111,14 @@ handle_work(void *arg)
>  	unsigned int count = 0;
>  	unsigned int num = 0;
>  	int i;
> -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
>  	struct rte_mbuf *buf[8] __rte_cache_aligned;
> 
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> +__ATOMIC_RELAXED); #else
> +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> +
>  	for (i = 0; i < 8; i++)
>  		buf[i] = NULL;
> 
> --
> 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-01 16:24   ` Honnappa Nagarahalli
  2019-04-01 16:24     ` Honnappa Nagarahalli
@ 2019-04-02  3:43     ` Phil Yang (Arm Technology China)
  2019-04-02  3:43       ` Phil Yang (Arm Technology China)
  1 sibling, 1 reply; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-02  3:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China), nd, nd, nd
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, April 2, 2019 12:24 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: david.hunt@intel.com; reshma.pattan@intel.com; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 2/3] test/distributor: replace sync builtins with atomic
> builtins
> 
> >
> > diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
> > index 98919ec..ddab08d 100644
> > --- a/app/test/test_distributor.c
> > +++ b/app/test/test_distributor.c
> > @@ -62,9 +62,14 @@ handle_work(void *arg)
> >  	struct worker_params *wp = arg;
> >  	struct rte_distributor *db = wp->dist;
> >  	unsigned int count = 0, num = 0;
> > -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> >  	int i;
> >
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> > +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> > +
> I suggest we remove the conditional compilation and just keep the
> __atomic_xxx calls as this is test code. More over the distributor library does
> not have a C11 version of the library (assuming that using __atomic_xxx does
> not impact performance on other platforms negatively). This applies to other
> instances in this patch.
Hi Honnappa,
Thanks for your comments.
Agree. I don't think __atomic_xxx will cause performance degradation on other platforms.  Remove the conditional compilation will make the code more clear.
Thanks,
Phil Yang
> 
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> >  	num = rte_distributor_get_pkt(db, id, buf, buf, num); @@ -270,7
> > +275,12 @@ handle_work_with_free_mbufs(void *arg)
> >  	unsigned int count = 0;
> >  	unsigned int i;
> >  	unsigned int num = 0;
> > +
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> >  	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> > +#endif
> >
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> > @@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
> >  	unsigned int total = 0;
> >  	unsigned int i;
> >  	unsigned int returned = 0;
> > +
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +			__ATOMIC_RELAXED);
> > +#else
> >  	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> > +#endif
> >
> >  	num = rte_distributor_get_pkt(d, id, buf, buf, num);
> >
> > diff --git a/app/test/test_distributor_perf.c
> > b/app/test/test_distributor_perf.c
> > index edf1998..9367460 100644
> > --- a/app/test/test_distributor_perf.c
> > +++ b/app/test/test_distributor_perf.c
> > @@ -111,9 +111,14 @@ handle_work(void *arg)
> >  	unsigned int count = 0;
> >  	unsigned int num = 0;
> >  	int i;
> > -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> >  	struct rte_mbuf *buf[8] __rte_cache_aligned;
> >
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> > +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> > +
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> >
> > --
> > 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-02  3:43     ` Phil Yang (Arm Technology China)
@ 2019-04-02  3:43       ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-02  3:43 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China), nd, nd, nd
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Tuesday, April 2, 2019 12:24 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: david.hunt@intel.com; reshma.pattan@intel.com; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH v2 2/3] test/distributor: replace sync builtins with atomic
> builtins
> 
> >
> > diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
> > index 98919ec..ddab08d 100644
> > --- a/app/test/test_distributor.c
> > +++ b/app/test/test_distributor.c
> > @@ -62,9 +62,14 @@ handle_work(void *arg)
> >  	struct worker_params *wp = arg;
> >  	struct rte_distributor *db = wp->dist;
> >  	unsigned int count = 0, num = 0;
> > -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> >  	int i;
> >
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> > +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> > +
> I suggest we remove the conditional compilation and just keep the
> __atomic_xxx calls as this is test code. More over the distributor library does
> not have a C11 version of the library (assuming that using __atomic_xxx does
> not impact performance on other platforms negatively). This applies to other
> instances in this patch.
Hi Honnappa,
Thanks for your comments.
Agree. I don't think __atomic_xxx will cause performance degradation on other platforms.  Remove the conditional compilation will make the code more clear.
Thanks,
Phil Yang
> 
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> >  	num = rte_distributor_get_pkt(db, id, buf, buf, num); @@ -270,7
> > +275,12 @@ handle_work_with_free_mbufs(void *arg)
> >  	unsigned int count = 0;
> >  	unsigned int i;
> >  	unsigned int num = 0;
> > +
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> >  	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> > +#endif
> >
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> > @@ -343,7 +353,13 @@ handle_work_for_shutdown_test(void *arg)
> >  	unsigned int total = 0;
> >  	unsigned int i;
> >  	unsigned int returned = 0;
> > +
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +			__ATOMIC_RELAXED);
> > +#else
> >  	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> > +#endif
> >
> >  	num = rte_distributor_get_pkt(d, id, buf, buf, num);
> >
> > diff --git a/app/test/test_distributor_perf.c
> > b/app/test/test_distributor_perf.c
> > index edf1998..9367460 100644
> > --- a/app/test/test_distributor_perf.c
> > +++ b/app/test/test_distributor_perf.c
> > @@ -111,9 +111,14 @@ handle_work(void *arg)
> >  	unsigned int count = 0;
> >  	unsigned int num = 0;
> >  	int i;
> > -	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
> >  	struct rte_mbuf *buf[8] __rte_cache_aligned;
> >
> > +#ifdef RTE_USE_C11_MEM_MODEL
> > +	unsigned int id = __atomic_fetch_add(&worker_idx, 1,
> > +__ATOMIC_RELAXED); #else
> > +	unsigned int id = __sync_fetch_and_add(&worker_idx, 1); #endif
> > +
> >  	for (i = 0; i < 8; i++)
> >  		buf[i] = NULL;
> >
> > --
> > 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
 
 
- * [dpdk-dev] [PATCH v2 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (3 preceding siblings ...)
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-03-29 10:56 ` Phil Yang
  2019-03-29 10:56   ` Phil Yang
  2019-04-01 16:24   ` Honnappa Nagarahalli
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v2 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 3/3] test/ring_perf: " Phil Yang
@ 2019-03-29 10:56   ` Phil Yang
  2019-04-01 16:24   ` Honnappa Nagarahalli
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-03-29 10:56 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v2 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 3/3] test/ring_perf: " Phil Yang
  2019-03-29 10:56   ` Phil Yang
@ 2019-04-01 16:24   ` Honnappa Nagarahalli
  2019-04-01 16:24     ` Honnappa Nagarahalli
  1 sibling, 1 reply; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-01 16:24 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, Honnappa Nagarahalli, olivier.matz, nd
<snip>
> diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c index
> ebb3939..e851c1a 100644
> --- a/app/test/test_ring_perf.c
> +++ b/app/test/test_ring_perf.c
> @@ -160,7 +160,11 @@ enqueue_bulk(void *p)
>  	unsigned i;
>  	void *burst[MAX_BURST] = {0};
> 
> -	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
> #else
> +	if (__sync_add_and_fetch(&lcore_count, 1) != 2) #endif
>  		while(lcore_count != 2)
>  			rte_pause();
Since, rte_ring library has both C11 and non-C11 implementations, conditional compilation should be fine here.
> 
> @@ -196,7 +200,11 @@ dequeue_bulk(void *p)
>  	unsigned i;
>  	void *burst[MAX_BURST] = {0};
> 
> -	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
> #else
> +	if (__sync_add_and_fetch(&lcore_count, 1) != 2) #endif
>  		while(lcore_count != 2)
>  			rte_pause();
> 
> --
> 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v2 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-04-01 16:24   ` Honnappa Nagarahalli
@ 2019-04-01 16:24     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-01 16:24 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, Honnappa Nagarahalli, olivier.matz, nd
<snip>
> diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c index
> ebb3939..e851c1a 100644
> --- a/app/test/test_ring_perf.c
> +++ b/app/test/test_ring_perf.c
> @@ -160,7 +160,11 @@ enqueue_bulk(void *p)
>  	unsigned i;
>  	void *burst[MAX_BURST] = {0};
> 
> -	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
> #else
> +	if (__sync_add_and_fetch(&lcore_count, 1) != 2) #endif
>  		while(lcore_count != 2)
>  			rte_pause();
Since, rte_ring library has both C11 and non-C11 implementations, conditional compilation should be fine here.
> 
> @@ -196,7 +200,11 @@ dequeue_bulk(void *p)
>  	unsigned i;
>  	void *burst[MAX_BURST] = {0};
> 
> -	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
> +#ifdef RTE_USE_C11_MEM_MODEL
> +	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
> #else
> +	if (__sync_add_and_fetch(&lcore_count, 1) != 2) #endif
>  		while(lcore_count != 2)
>  			rte_pause();
> 
> --
> 2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
 
 
- * [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (4 preceding siblings ...)
  2019-03-29 10:56 ` [dpdk-dev] [PATCH v2 3/3] test/ring_perf: " Phil Yang
@ 2019-04-03  6:59 ` Phil Yang
  2019-04-03  6:59   ` Phil Yang
                     ` (3 more replies)
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations Phil Yang
                   ` (3 subsequent siblings)
  9 siblings, 4 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v3:
test_distributor: Remove the conditional compilation and just keep the
__atomic_xxx calls. (Honnappa Nagarahalli)
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
v1:
Reimplement packet_ordering with __atomic one-way barrier.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  |  7 ++--
 app/test/test_distributor_perf.c             |  2 +-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 63 insertions(+), 12 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
@ 2019-04-03  6:59   ` Phil Yang
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread Phil Yang
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v3:
test_distributor: Remove the conditional compilation and just keep the
__atomic_xxx calls. (Honnappa Nagarahalli)
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
v1:
Reimplement packet_ordering with __atomic one-way barrier.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  |  7 ++--
 app/test/test_distributor_perf.c             |  2 +-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 63 insertions(+), 12 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
  2019-04-03  6:59   ` Phil Yang
@ 2019-04-03  6:59   ` Phil Yang
  2019-04-03  6:59     ` Phil Yang
  2019-04-04 23:24     ` Thomas Monjalon
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 3/3] test/ring_perf: " Phil Yang
  3 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using the '__sync' built-ins to synchronize
statistics within worker threads. The '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics to remove the synchronisation between worker threads.
Since the maximum core number can get to 256, so disable the per core
stats print in default and add the --insight-worker option to enable it.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..1c8ee5d 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enables output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..8145074 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-04-03  6:59     ` Phil Yang
  2019-04-04 23:24     ` Thomas Monjalon
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using the '__sync' built-ins to synchronize
statistics within worker threads. The '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics to remove the synchronisation between worker threads.
Since the maximum core number can get to 256, so disable the per core
stats print in default and add the --insight-worker option to enable it.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..1c8ee5d 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enables output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..8145074 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread Phil Yang
  2019-04-03  6:59     ` Phil Yang
@ 2019-04-04 23:24     ` Thomas Monjalon
  2019-04-04 23:24       ` Thomas Monjalon
  2019-04-08  4:04       ` Phil Yang (Arm Technology China)
  1 sibling, 2 replies; 56+ messages in thread
From: Thomas Monjalon @ 2019-04-04 23:24 UTC (permalink / raw)
  To: Phil Yang
  Cc: dev, david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, nd
Hi,
03/04/2019 08:59, Phil Yang:
> The current implementation using the '__sync' built-ins to synchronize
> statistics within worker threads. The '__sync' built-ins functions are
> full barriers which will affect the performance, so add a per worker
> packets statistics to remove the synchronisation between worker threads.
> 
> Since the maximum core number can get to 256, so disable the per core
> stats print in default and add the --insight-worker option to enable it.
[...]
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
There is an error with clang:
examples/packet_ordering/main.c:82:57: error:
	suggest braces around initialization of subobject
static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-04 23:24     ` Thomas Monjalon
@ 2019-04-04 23:24       ` Thomas Monjalon
  2019-04-08  4:04       ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Thomas Monjalon @ 2019-04-04 23:24 UTC (permalink / raw)
  To: Phil Yang
  Cc: dev, david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, nd
Hi,
03/04/2019 08:59, Phil Yang:
> The current implementation using the '__sync' built-ins to synchronize
> statistics within worker threads. The '__sync' built-ins functions are
> full barriers which will affect the performance, so add a per worker
> packets statistics to remove the synchronisation between worker threads.
> 
> Since the maximum core number can get to 256, so disable the per core
> stats print in default and add the --insight-worker option to enable it.
[...]
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
There is an error with clang:
examples/packet_ordering/main.c:82:57: error:
	suggest braces around initialization of subobject
static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-04 23:24     ` Thomas Monjalon
  2019-04-04 23:24       ` Thomas Monjalon
@ 2019-04-08  4:04       ` Phil Yang (Arm Technology China)
  2019-04-08  4:04         ` Phil Yang (Arm Technology China)
  1 sibling, 1 reply; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-08  4:04 UTC (permalink / raw)
  To: thomas
  Cc: dev, david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, April 5, 2019 7:25 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Cc: dev@dpdk.org; david.hunt@intel.com; reshma.pattan@intel.com; Gavin
> Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for
> each worker thread
> 
> Hi,
> 
> 03/04/2019 08:59, Phil Yang:
> > The current implementation using the '__sync' built-ins to synchronize
> > statistics within worker threads. The '__sync' built-ins functions are
> > full barriers which will affect the performance, so add a per worker
> > packets statistics to remove the synchronisation between worker threads.
> >
> > Since the maximum core number can get to 256, so disable the per core
> > stats print in default and add the --insight-worker option to enable it.
> [...]
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> There is an error with clang:
> 
> examples/packet_ordering/main.c:82:57: error:
> 	suggest braces around initialization of subobject static struct
> wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
Thanks, Thomas.
It is my bad. I should test with the devtools/test-meson-builds.sh script instead of build with default meson compiler before sending it out.
I have addressed this defect in the new version. Please review it.
Thanks,
Phil
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread
  2019-04-08  4:04       ` Phil Yang (Arm Technology China)
@ 2019-04-08  4:04         ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-08  4:04 UTC (permalink / raw)
  To: thomas
  Cc: dev, david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, April 5, 2019 7:25 AM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Cc: dev@dpdk.org; david.hunt@intel.com; reshma.pattan@intel.com; Gavin
> Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for
> each worker thread
> 
> Hi,
> 
> 03/04/2019 08:59, Phil Yang:
> > The current implementation using the '__sync' built-ins to synchronize
> > statistics within worker threads. The '__sync' built-ins functions are
> > full barriers which will affect the performance, so add a per worker
> > packets statistics to remove the synchronisation between worker threads.
> >
> > Since the maximum core number can get to 256, so disable the per core
> > stats print in default and add the --insight-worker option to enable it.
> [...]
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Phil Yang <phil.yang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> 
> There is an error with clang:
> 
> examples/packet_ordering/main.c:82:57: error:
> 	suggest braces around initialization of subobject static struct
> wkr_stats_per wkr_stats[RTE_MAX_LCORE] = {0};
Thanks, Thomas.
It is my bad. I should test with the devtools/test-meson-builds.sh script instead of build with default meson compiler before sending it out.
I have addressed this defect in the new version. Please review it.
Thanks,
Phil
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread
 
 
 
- * [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
  2019-04-03  6:59   ` Phil Yang
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-04-03  6:59   ` Phil Yang
  2019-04-03  6:59     ` Phil Yang
  2019-04-04 15:30     ` Honnappa Nagarahalli
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 3/3] test/ring_perf: " Phil Yang
  3 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_distributor.c      | 7 ++++---
 app/test/test_distributor_perf.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..0364637 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,7 +62,7 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	int i;
 
 	for (i = 0; i < 8; i++)
@@ -270,7 +270,7 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +343,8 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
-	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..89b28f0 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,7 +111,7 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
 	for (i = 0; i < 8; i++)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-04-03  6:59     ` Phil Yang
  2019-04-04 15:30     ` Honnappa Nagarahalli
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_distributor.c      | 7 ++++---
 app/test/test_distributor_perf.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..0364637 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,7 +62,7 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	int i;
 
 	for (i = 0; i < 8; i++)
@@ -270,7 +270,7 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +343,8 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
-	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..89b28f0 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,7 +111,7 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
 	for (i = 0; i < 8; i++)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
  2019-04-03  6:59     ` Phil Yang
@ 2019-04-04 15:30     ` Honnappa Nagarahalli
  2019-04-04 15:30       ` Honnappa Nagarahalli
  1 sibling, 1 reply; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-04 15:30 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, nd
> 
> '__sync' built-in functions are deprecated, should use the '__atomic'
> built-in instead. the sync built-in functions are full barriers, while atomic
> built-in functions offer less restrictive one-way barriers, which help
> performance.
> 
> Here is the example test result on TX2:
> sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4 --socket-
> mem=1024,1024 -- -i
> RTE>>distributor_perf_autotest
> 
> *** distributor_perf_autotest without this patch *** ==== Cache line switch
> test === Time for 33554432 iterations = 1519202730 ticks Ticks per iteration
> = 45
> 
> *** distributor_perf_autotest with this patch *** ==== Cache line switch test
> === Time for 33554432 iterations = 1251715496 ticks Ticks per iteration = 37
> 
> Less ticks needed for the cache line switch test. It got 17% of performance
> improvement.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> ---
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-04 15:30     ` Honnappa Nagarahalli
@ 2019-04-04 15:30       ` Honnappa Nagarahalli
  0 siblings, 0 replies; 56+ messages in thread
From: Honnappa Nagarahalli @ 2019-04-04 15:30 UTC (permalink / raw)
  To: Phil Yang (Arm Technology China), dev, thomas
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Phil Yang (Arm Technology China),
	nd, nd
> 
> '__sync' built-in functions are deprecated, should use the '__atomic'
> built-in instead. the sync built-in functions are full barriers, while atomic
> built-in functions offer less restrictive one-way barriers, which help
> performance.
> 
> Here is the example test result on TX2:
> sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4 --socket-
> mem=1024,1024 -- -i
> RTE>>distributor_perf_autotest
> 
> *** distributor_perf_autotest without this patch *** ==== Cache line switch
> test === Time for 33554432 iterations = 1519202730 ticks Ticks per iteration
> = 45
> 
> *** distributor_perf_autotest with this patch *** ==== Cache line switch test
> === Time for 33554432 iterations = 1251715496 ticks Ticks per iteration = 37
> 
> Less ticks needed for the cache line switch test. It got 17% of performance
> improvement.
> 
> Signed-off-by: Phil Yang <phil.yang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Joyce Kong <joyce.kong@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> ---
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
 
- * [dpdk-dev] [PATCH v3 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
                     ` (2 preceding siblings ...)
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-04-03  6:59   ` Phil Yang
  2019-04-03  6:59     ` Phil Yang
  3 siblings, 1 reply; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v3 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-04-03  6:59   ` [dpdk-dev] [PATCH v3 3/3] test/ring_perf: " Phil Yang
@ 2019-04-03  6:59     ` Phil Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-03  6:59 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
 
 
- * [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (5 preceding siblings ...)
  2019-04-03  6:59 ` [dpdk-dev] [PATCH v3 0/3] example and test cases optimizations Phil Yang
@ 2019-04-08  3:02 ` Phil Yang
  2019-04-08  3:02   ` Phil Yang
                     ` (2 more replies)
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 1/3] packet_ordering: add statistics for each worker thread Phil Yang
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v4:
packet_ordering: Fixed -Wmissing-braces issue for clang build.
(Thomas Monjalon)
v3:
test_distributor: Remove the conditional compilation and just keep the
__atomic_xxx calls. (Honnappa Nagarahalli)
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
v1:
Reimplement packet_ordering with __atomic one-way barrier.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  |  7 ++--
 app/test/test_distributor_perf.c             |  2 +-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 63 insertions(+), 12 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations Phil Yang
@ 2019-04-08  3:02   ` Phil Yang
  2019-07-04 20:15   ` Thomas Monjalon
  2019-07-08 14:38   ` Thomas Monjalon
  2 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
v4:
packet_ordering: Fixed -Wmissing-braces issue for clang build.
(Thomas Monjalon)
v3:
test_distributor: Remove the conditional compilation and just keep the
__atomic_xxx calls. (Honnappa Nagarahalli)
v2:
1. Add per lcore statistics for each worker thread, removed __sync
builtins.
2. Reimplement test_distributor with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
3. Reimplement test_ring_perf with atomic one-way barrier, if
C11_MEM_MODEL is enabled.
v1:
Reimplement packet_ordering with __atomic one-way barrier.
Phil Yang (3):
  packet_ordering: add statistics for each worker thread
  test/distributor: replace sync builtins with atomic builtins
  test/ring_perf: replace sync builtins with atomic builtins
 app/test/test_distributor.c                  |  7 ++--
 app/test/test_distributor_perf.c             |  2 +-
 app/test/test_ring_perf.c                    | 12 +++++--
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 5 files changed, 63 insertions(+), 12 deletions(-)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations Phil Yang
  2019-04-08  3:02   ` Phil Yang
@ 2019-07-04 20:15   ` Thomas Monjalon
  2019-07-05  3:19     ` Phil Yang (Arm Technology China)
  2019-07-08 14:38   ` Thomas Monjalon
  2 siblings, 1 reply; 56+ messages in thread
From: Thomas Monjalon @ 2019-07-04 20:15 UTC (permalink / raw)
  To: dev, Phil Yang
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, nd,
	bruce.richardson, konstantin.ananyev, david.marchand
08/04/2019 05:02, Phil Yang:
> Phil Yang (3):
>   packet_ordering: add statistics for each worker thread
>   test/distributor: replace sync builtins with atomic builtins
>   test/ring_perf: replace sync builtins with atomic builtins
No more update regarding the performance gain or loss?
Any decision to merge it in 19.08?
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
  2019-07-04 20:15   ` Thomas Monjalon
@ 2019-07-05  3:19     ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-07-05  3:19 UTC (permalink / raw)
  To: thomas, dev
  Cc: david.hunt, reshma.pattan, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, bruce.richardson, konstantin.ananyev,
	david.marchand, nd
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, July 5, 2019 4:16 AM
> To: dev@dpdk.org; Phil Yang (Arm Technology China) <Phil.Yang@arm.com>
> Cc: david.hunt@intel.com; reshma.pattan@intel.com; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> david.marchand@redhat.com
> Subject: Re: [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
> 
> 08/04/2019 05:02, Phil Yang:
> > Phil Yang (3):
> >   packet_ordering: add statistics for each worker thread
> >   test/distributor: replace sync builtins with atomic builtins
> >   test/ring_perf: replace sync builtins with atomic builtins
> 
Hi Thomas,
Thanks for refreshing.
> No more update regarding the performance gain or loss?
Currently, I have no more update on this.  I didn't saw any performance loss with those patches on x86_64 and aarch64 platforms.
> 
> Any decision to merge it in 19.08?
> 
Thanks,
Phil
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
- * Re: [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations Phil Yang
  2019-04-08  3:02   ` Phil Yang
  2019-07-04 20:15   ` Thomas Monjalon
@ 2019-07-08 14:38   ` Thomas Monjalon
  2 siblings, 0 replies; 56+ messages in thread
From: Thomas Monjalon @ 2019-07-08 14:38 UTC (permalink / raw)
  To: Phil Yang
  Cc: dev, david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, nd
08/04/2019 05:02, Phil Yang:
> Phil Yang (3):
>   packet_ordering: add statistics for each worker thread
>   test/distributor: replace sync builtins with atomic builtins
>   test/ring_perf: replace sync builtins with atomic builtins
After months of wait for comments,
Applied, thanks
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
- * [dpdk-dev] [PATCH v4 1/3] packet_ordering: add statistics for each worker thread
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (6 preceding siblings ...)
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 0/3] example and test cases optimizations Phil Yang
@ 2019-04-08  3:02 ` Phil Yang
  2019-04-08  3:02   ` Phil Yang
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 3/3] test/ring_perf: " Phil Yang
  9 siblings, 1 reply; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using the '__sync' built-ins to synchronize
statistics within worker threads. The '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics to remove the synchronisation between worker threads.
Since the maximum core number can get to 256, so disable the per core
stats print in default and add the --insight-worker option to enable it.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..1c8ee5d 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enables output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..1f1bf37 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = { {0} };
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v4 1/3] packet_ordering: add statistics for each worker thread
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-04-08  3:02   ` Phil Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
The current implementation using the '__sync' built-ins to synchronize
statistics within worker threads. The '__sync' built-ins functions are
full barriers which will affect the performance, so add a per worker
packets statistics to remove the synchronisation between worker threads.
Since the maximum core number can get to 256, so disable the per core
stats print in default and add the --insight-worker option to enable it.
For example:
sudo examples/packet_ordering/arm64-armv8a-linuxapp-gcc/packet_ordering \
-l 112-115 --socket-mem=1024,1024 -n 4 -- -p 0x03 --insight-worker
RX thread stats:
 - Pkts rxd:                            226539223
 - Pkts enqd to workers ring:           226539223
Worker thread stats on core [113]:
 - Pkts deqd from workers ring:         77557888
 - Pkts enqd to tx ring:                77557888
 - Pkts enq to tx failed:               0
Worker thread stats on core [114]:
 - Pkts deqd from workers ring:         148981335
 - Pkts enqd to tx ring:                148981335
 - Pkts enq to tx failed:               0
Worker thread stats:
 - Pkts deqd from workers ring:         226539223
 - Pkts enqd to tx ring:                226539223
 - Pkts enq to tx failed:               0
TX stats:
 - Pkts deqd from tx ring:              226539223
 - Ro Pkts transmitted:                 226539168
 - Ro Pkts tx failed:                   0
 - Pkts transmitted w/o reorder:        0
 - Pkts tx failed w/o reorder:          0
Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 doc/guides/sample_app_ug/packet_ordering.rst |  4 ++-
 examples/packet_ordering/main.c              | 50 +++++++++++++++++++++++++---
 2 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/doc/guides/sample_app_ug/packet_ordering.rst b/doc/guides/sample_app_ug/packet_ordering.rst
index 7cfcf3f..1c8ee5d 100644
--- a/doc/guides/sample_app_ug/packet_ordering.rst
+++ b/doc/guides/sample_app_ug/packet_ordering.rst
@@ -43,7 +43,7 @@ The application execution command line is:
 
 .. code-block:: console
 
-    ./test-pipeline [EAL options] -- -p PORTMASK [--disable-reorder]
+    ./packet_ordering [EAL options] -- -p PORTMASK [--disable-reorder] [--insight-worker]
 
 The -c EAL CPU_COREMASK option has to contain at least 3 CPU cores.
 The first CPU core in the core mask is the master core and would be assigned to
@@ -56,3 +56,5 @@ then the other pair from 2 to 3 and from 3 to 2, having [0,1] and [2,3] pairs.
 
 The disable-reorder long option does, as its name implies, disable the reordering
 of traffic, which should help evaluate reordering performance impact.
+
+The insight-worker long option enables output the packet statistics of each worker thread.
diff --git a/examples/packet_ordering/main.c b/examples/packet_ordering/main.c
index 149bfdd..1f1bf37 100644
--- a/examples/packet_ordering/main.c
+++ b/examples/packet_ordering/main.c
@@ -31,6 +31,7 @@
 
 unsigned int portmask;
 unsigned int disable_reorder;
+unsigned int insight_worker;
 volatile uint8_t quit_signal;
 
 static struct rte_mempool *mbuf_pool;
@@ -71,6 +72,14 @@ volatile struct app_stats {
 	} tx __rte_cache_aligned;
 } app_stats;
 
+/* per worker lcore stats */
+struct wkr_stats_per {
+		uint64_t deq_pkts;
+		uint64_t enq_pkts;
+		uint64_t enq_failed_pkts;
+} __rte_cache_aligned;
+
+static struct wkr_stats_per wkr_stats[RTE_MAX_LCORE] = { {0} };
 /**
  * Get the last enabled lcore ID
  *
@@ -152,6 +161,7 @@ parse_args(int argc, char **argv)
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"disable-reorder", 0, 0, 0},
+		{"insight-worker", 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
@@ -175,6 +185,11 @@ parse_args(int argc, char **argv)
 				printf("reorder disabled\n");
 				disable_reorder = 1;
 			}
+			if (!strcmp(lgopts[option_index].name,
+						"insight-worker")) {
+				printf("print all worker statistics\n");
+				insight_worker = 1;
+			}
 			break;
 		default:
 			print_usage(prgname);
@@ -319,6 +334,11 @@ print_stats(void)
 {
 	uint16_t i;
 	struct rte_eth_stats eth_stats;
+	unsigned int lcore_id, last_lcore_id, master_lcore_id, end_w_lcore_id;
+
+	last_lcore_id   = get_last_lcore_id();
+	master_lcore_id = rte_get_master_lcore();
+	end_w_lcore_id  = get_previous_lcore_id(last_lcore_id);
 
 	printf("\nRX thread stats:\n");
 	printf(" - Pkts rxd:				%"PRIu64"\n",
@@ -326,6 +346,26 @@ print_stats(void)
 	printf(" - Pkts enqd to workers ring:		%"PRIu64"\n",
 						app_stats.rx.enqueue_pkts);
 
+	for (lcore_id = 0; lcore_id <= end_w_lcore_id; lcore_id++) {
+		if (insight_worker
+			&& rte_lcore_is_enabled(lcore_id)
+			&& lcore_id != master_lcore_id) {
+			printf("\nWorker thread stats on core [%u]:\n",
+					lcore_id);
+			printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].deq_pkts);
+			printf(" - Pkts enqd to tx ring:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_pkts);
+			printf(" - Pkts enq to tx failed:		%"PRIu64"\n",
+					wkr_stats[lcore_id].enq_failed_pkts);
+		}
+
+		app_stats.wkr.dequeue_pkts += wkr_stats[lcore_id].deq_pkts;
+		app_stats.wkr.enqueue_pkts += wkr_stats[lcore_id].enq_pkts;
+		app_stats.wkr.enqueue_failed_pkts +=
+			wkr_stats[lcore_id].enq_failed_pkts;
+	}
+
 	printf("\nWorker thread stats:\n");
 	printf(" - Pkts deqd from workers ring:		%"PRIu64"\n",
 						app_stats.wkr.dequeue_pkts);
@@ -432,13 +472,14 @@ worker_thread(void *args_ptr)
 	struct rte_mbuf *burst_buffer[MAX_PKTS_BURST] = { NULL };
 	struct rte_ring *ring_in, *ring_out;
 	const unsigned xor_val = (nb_ports > 1);
+	unsigned int core_id = rte_lcore_id();
 
 	args = (struct worker_thread_args *) args_ptr;
 	ring_in  = args->ring_in;
 	ring_out = args->ring_out;
 
 	RTE_LOG(INFO, REORDERAPP, "%s() started on lcore %u\n", __func__,
-							rte_lcore_id());
+							core_id);
 
 	while (!quit_signal) {
 
@@ -448,7 +489,7 @@ worker_thread(void *args_ptr)
 		if (unlikely(burst_size == 0))
 			continue;
 
-		__sync_fetch_and_add(&app_stats.wkr.dequeue_pkts, burst_size);
+		wkr_stats[core_id].deq_pkts += burst_size;
 
 		/* just do some operation on mbuf */
 		for (i = 0; i < burst_size;)
@@ -457,11 +498,10 @@ worker_thread(void *args_ptr)
 		/* enqueue the modified mbufs to workers_to_tx ring */
 		ret = rte_ring_enqueue_burst(ring_out, (void *)burst_buffer,
 				burst_size, NULL);
-		__sync_fetch_and_add(&app_stats.wkr.enqueue_pkts, ret);
+		wkr_stats[core_id].enq_pkts += ret;
 		if (unlikely(ret < burst_size)) {
 			/* Return the mbufs to their respective pool, dropping packets */
-			__sync_fetch_and_add(&app_stats.wkr.enqueue_failed_pkts,
-					(int)burst_size - ret);
+			wkr_stats[core_id].enq_failed_pkts += burst_size - ret;
 			pktmbuf_free_bulk(&burst_buffer[ret], burst_size - ret);
 		}
 	}
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
 
- * [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (7 preceding siblings ...)
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 1/3] packet_ordering: add statistics for each worker thread Phil Yang
@ 2019-04-08  3:02 ` Phil Yang
  2019-04-08  3:02   ` Phil Yang
  2019-04-10 14:05   ` Hunt, David
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 3/3] test/ring_perf: " Phil Yang
  9 siblings, 2 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/test_distributor.c      | 7 ++++---
 app/test/test_distributor_perf.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..0364637 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,7 +62,7 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	int i;
 
 	for (i = 0; i < 8; i++)
@@ -270,7 +270,7 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +343,8 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
-	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..89b28f0 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,7 +111,7 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
 	for (i = 0; i < 8; i++)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-04-08  3:02   ` Phil Yang
  2019-04-10 14:05   ` Hunt, David
  1 sibling, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
-n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
*** distributor_perf_autotest without this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1519202730 ticks
Ticks per iteration = 45
*** distributor_perf_autotest with this patch ***
==== Cache line switch test ===
Time for 33554432 iterations = 1251715496 ticks
Ticks per iteration = 37
Less ticks needed for the cache line switch test. It got 17% of
performance improvement.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/test_distributor.c      | 7 ++++---
 app/test/test_distributor_perf.c | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/app/test/test_distributor.c b/app/test/test_distributor.c
index 98919ec..0364637 100644
--- a/app/test/test_distributor.c
+++ b/app/test/test_distributor.c
@@ -62,7 +62,7 @@ handle_work(void *arg)
 	struct worker_params *wp = arg;
 	struct rte_distributor *db = wp->dist;
 	unsigned int count = 0, num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	int i;
 
 	for (i = 0; i < 8; i++)
@@ -270,7 +270,7 @@ handle_work_with_free_mbufs(void *arg)
 	unsigned int count = 0;
 	unsigned int i;
 	unsigned int num = 0;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 
 	for (i = 0; i < 8; i++)
 		buf[i] = NULL;
@@ -343,7 +343,8 @@ handle_work_for_shutdown_test(void *arg)
 	unsigned int total = 0;
 	unsigned int i;
 	unsigned int returned = 0;
-	const unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	const unsigned int id = __atomic_fetch_add(&worker_idx, 1,
+			__ATOMIC_RELAXED);
 
 	num = rte_distributor_get_pkt(d, id, buf, buf, num);
 
diff --git a/app/test/test_distributor_perf.c b/app/test/test_distributor_perf.c
index edf1998..89b28f0 100644
--- a/app/test/test_distributor_perf.c
+++ b/app/test/test_distributor_perf.c
@@ -111,7 +111,7 @@ handle_work(void *arg)
 	unsigned int count = 0;
 	unsigned int num = 0;
 	int i;
-	unsigned int id = __sync_fetch_and_add(&worker_idx, 1);
+	unsigned int id = __atomic_fetch_add(&worker_idx, 1, __ATOMIC_RELAXED);
 	struct rte_mbuf *buf[8] __rte_cache_aligned;
 
 	for (i = 0; i < 8; i++)
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
  2019-04-08  3:02   ` Phil Yang
@ 2019-04-10 14:05   ` Hunt, David
  2019-04-10 14:05     ` Hunt, David
  2019-04-11 11:31     ` Phil Yang (Arm Technology China)
  1 sibling, 2 replies; 56+ messages in thread
From: Hunt, David @ 2019-04-10 14:05 UTC (permalink / raw)
  To: Phil Yang, dev, thomas; +Cc: reshma.pattan, gavin.hu, honnappa.nagarahalli, nd
Hi Phil,
On 8/4/2019 4:02 AM, Phil Yang wrote:
> '__sync' built-in functions are deprecated, should use the '__atomic'
> built-in instead. the sync built-in functions are full barriers, while
> atomic built-in functions offer less restrictive one-way barriers,
> which help performance.
>
> Here is the example test result on TX2:
> sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
> -n 4 --socket-mem=1024,1024 -- -i
> RTE>>distributor_perf_autotest
>
> *** distributor_perf_autotest without this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1519202730 ticks
> Ticks per iteration = 45
>
> *** distributor_perf_autotest with this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1251715496 ticks
> Ticks per iteration = 37
>
> Less ticks needed for the cache line switch test. It got 17% of
> performance improvement.
I'm seeing about an 8% performance degradation on my platform for the 
cache line switch test with the patch, however the single mode and burst 
mode tests area showing no difference, which are the more important 
tests. What kind of differences are you seeing in the single/burst mode 
tests?
Rgds,
Dave.
---snip---
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-10 14:05   ` Hunt, David
@ 2019-04-10 14:05     ` Hunt, David
  2019-04-11 11:31     ` Phil Yang (Arm Technology China)
  1 sibling, 0 replies; 56+ messages in thread
From: Hunt, David @ 2019-04-10 14:05 UTC (permalink / raw)
  To: Phil Yang, dev, thomas; +Cc: reshma.pattan, gavin.hu, honnappa.nagarahalli, nd
Hi Phil,
On 8/4/2019 4:02 AM, Phil Yang wrote:
> '__sync' built-in functions are deprecated, should use the '__atomic'
> built-in instead. the sync built-in functions are full barriers, while
> atomic built-in functions offer less restrictive one-way barriers,
> which help performance.
>
> Here is the example test result on TX2:
> sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \
> -n 4 --socket-mem=1024,1024 -- -i
> RTE>>distributor_perf_autotest
>
> *** distributor_perf_autotest without this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1519202730 ticks
> Ticks per iteration = 45
>
> *** distributor_perf_autotest with this patch ***
> ==== Cache line switch test ===
> Time for 33554432 iterations = 1251715496 ticks
> Ticks per iteration = 37
>
> Less ticks needed for the cache line switch test. It got 17% of
> performance improvement.
I'm seeing about an 8% performance degradation on my platform for the 
cache line switch test with the patch, however the single mode and burst 
mode tests area showing no difference, which are the more important 
tests. What kind of differences are you seeing in the single/burst mode 
tests?
Rgds,
Dave.
---snip---
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-10 14:05   ` Hunt, David
  2019-04-10 14:05     ` Hunt, David
@ 2019-04-11 11:31     ` Phil Yang (Arm Technology China)
  2019-04-11 11:31       ` Phil Yang (Arm Technology China)
  1 sibling, 1 reply; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-11 11:31 UTC (permalink / raw)
  To: Hunt, David, dev, thomas
  Cc: reshma.pattan, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Hunt, David <david.hunt@intel.com>
> Sent: Wednesday, April 10, 2019 10:06 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: reshma.pattan@intel.com; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v4 2/3] test/distributor: replace sync builtins with atomic
> builtins
> 
> Hi Phil,
> 
> On 8/4/2019 4:02 AM, Phil Yang wrote:
> > '__sync' built-in functions are deprecated, should use the '__atomic'
> > built-in instead. the sync built-in functions are full barriers, while
> > atomic built-in functions offer less restrictive one-way barriers,
> > which help performance.
> >
> > Here is the example test result on TX2:
> > sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4
> > --socket-mem=1024,1024 -- -i
> > RTE>>distributor_perf_autotest
> >
> > *** distributor_perf_autotest without this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1519202730 ticks Ticks
> > per iteration = 45
> >
> > *** distributor_perf_autotest with this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1251715496 ticks Ticks
> > per iteration = 37
> >
> > Less ticks needed for the cache line switch test. It got 17% of
> > performance improvement.
> 
> 
Hi, Dave
Thanks for your input.
> I'm seeing about an 8% performance degradation on my platform for the
I'd tested this patch on our x86 server (E5-2640 v3 @ 2.60GHz) several rounds. However, I didn't found performance degradation. Please check the test result below.
$ sudo  ./x86_64-native-linuxapp-gcc/app/test -l 8-15 -n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
####  without this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12379399910 ticks
Ticks per iteration = 368
=== Performance test of distributor (single mode) ===
Time per burst:  5815
Time per packet: 90
=== Performance test of distributor (burst mode) ===
Time per burst:  3487
Time per packet: 54
####  with this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12388791845 ticks
Ticks per iteration = 369
=== Performance test of distributor (single mode) ===
Time per burst:  5796
Time per packet: 90
=== Performance test of distributor (burst mode) ===
Time per burst:  3477
Time per packet: 54
From my test, there was a little bit of performance improvement (You can also think of it as a measurement bias) on x86. 
> cache line switch test with the patch, however the single mode and burst
> mode tests area showing no difference, which are the more important tests.
> What kind of differences are you seeing in the single/burst mode tests?
Actually, I found no difference in the single mode and burst mode on aarch64 neither. I think it means this test case is not the hotspot for those two mode's performance. 
Just like the __sync_xxx builtins, the __atomic_xxx builtins are atomic operations, which elide the memory barrier. So I think it should benefit all platform.
Thanks,
Phil
> 
> Rgds,
> Dave.
> 
> 
> ---snip---
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread 
- * Re: [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins
  2019-04-11 11:31     ` Phil Yang (Arm Technology China)
@ 2019-04-11 11:31       ` Phil Yang (Arm Technology China)
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang (Arm Technology China) @ 2019-04-11 11:31 UTC (permalink / raw)
  To: Hunt, David, dev, thomas
  Cc: reshma.pattan, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd
> -----Original Message-----
> From: Hunt, David <david.hunt@intel.com>
> Sent: Wednesday, April 10, 2019 10:06 PM
> To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; dev@dpdk.org;
> thomas@monjalon.net
> Cc: reshma.pattan@intel.com; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v4 2/3] test/distributor: replace sync builtins with atomic
> builtins
> 
> Hi Phil,
> 
> On 8/4/2019 4:02 AM, Phil Yang wrote:
> > '__sync' built-in functions are deprecated, should use the '__atomic'
> > built-in instead. the sync built-in functions are full barriers, while
> > atomic built-in functions offer less restrictive one-way barriers,
> > which help performance.
> >
> > Here is the example test result on TX2:
> > sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4
> > --socket-mem=1024,1024 -- -i
> > RTE>>distributor_perf_autotest
> >
> > *** distributor_perf_autotest without this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1519202730 ticks Ticks
> > per iteration = 45
> >
> > *** distributor_perf_autotest with this patch *** ==== Cache line
> > switch test === Time for 33554432 iterations = 1251715496 ticks Ticks
> > per iteration = 37
> >
> > Less ticks needed for the cache line switch test. It got 17% of
> > performance improvement.
> 
> 
Hi, Dave
Thanks for your input.
> I'm seeing about an 8% performance degradation on my platform for the
I'd tested this patch on our x86 server (E5-2640 v3 @ 2.60GHz) several rounds. However, I didn't found performance degradation. Please check the test result below.
$ sudo  ./x86_64-native-linuxapp-gcc/app/test -l 8-15 -n 4 --socket-mem=1024,1024 -- -i
RTE>>distributor_perf_autotest
####  without this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12379399910 ticks
Ticks per iteration = 368
=== Performance test of distributor (single mode) ===
Time per burst:  5815
Time per packet: 90
=== Performance test of distributor (burst mode) ===
Time per burst:  3487
Time per packet: 54
####  with this patch ####
==== Cache line switch test ===
Time for 33554432 iterations = 12388791845 ticks
Ticks per iteration = 369
=== Performance test of distributor (single mode) ===
Time per burst:  5796
Time per packet: 90
=== Performance test of distributor (burst mode) ===
Time per burst:  3477
Time per packet: 54
From my test, there was a little bit of performance improvement (You can also think of it as a measurement bias) on x86. 
> cache line switch test with the patch, however the single mode and burst
> mode tests area showing no difference, which are the more important tests.
> What kind of differences are you seeing in the single/burst mode tests?
Actually, I found no difference in the single mode and burst mode on aarch64 neither. I think it means this test case is not the hotspot for those two mode's performance. 
Just like the __sync_xxx builtins, the __atomic_xxx builtins are atomic operations, which elide the memory barrier. So I think it should benefit all platform.
Thanks,
Phil
> 
> Rgds,
> Dave.
> 
> 
> ---snip---
> 
> 
^ permalink raw reply	[flat|nested] 56+ messages in thread 
 
 
 
- * [dpdk-dev] [PATCH v4 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-01-03  9:49 [dpdk-dev] [PATCH] packet_ordering: replace sync builtins with atomic builtins Phil Yang
                   ` (8 preceding siblings ...)
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 2/3] test/distributor: replace sync builtins with atomic builtins Phil Yang
@ 2019-04-08  3:02 ` Phil Yang
  2019-04-08  3:02   ` Phil Yang
  9 siblings, 1 reply; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread
- * [dpdk-dev] [PATCH v4 3/3] test/ring_perf: replace sync builtins with atomic builtins
  2019-04-08  3:02 ` [dpdk-dev] [PATCH v4 3/3] test/ring_perf: " Phil Yang
@ 2019-04-08  3:02   ` Phil Yang
  0 siblings, 0 replies; 56+ messages in thread
From: Phil Yang @ 2019-04-08  3:02 UTC (permalink / raw)
  To: dev, thomas
  Cc: david.hunt, reshma.pattan, gavin.hu, honnappa.nagarahalli, phil.yang, nd
'__sync' built-in functions are deprecated, should use the '__atomic'
built-in instead. the sync built-in functions are full barriers, while
atomic built-in functions offer less restrictive one-way barriers,
which help performance.
Here is the example test result on TX2:
sudo ./arm64-armv8a-linuxapp-gcc/app/test -c 0x7fffffe \
-n 4 --socket-mem=1024,0 --file-prefix=~ -- -i
RTE>>ring_perf_autotest
*** ring_perf_autotest without this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.22
MP/MC bulk enq/dequeue (size: 8): 11.50
SP/SC bulk enq/dequeue (size: 32): 1.85
MP/MC bulk enq/dequeue (size: 32): 2.66
*** ring_perf_autotest with this patch ***
SP/SC bulk enq/dequeue (size: 8): 6.13
MP/MC bulk enq/dequeue (size: 8): 9.83
SP/SC bulk enq/dequeue (size: 32): 1.96
MP/MC bulk enq/dequeue (size: 32): 2.30
So for the ring performance test, this patch improved 11% of ring
operations performance.
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Joyce Kong <joyce.kong@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
---
 app/test/test_ring_perf.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939..e851c1a 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -160,7 +160,11 @@ enqueue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
@@ -196,7 +200,11 @@ dequeue_bulk(void *p)
 	unsigned i;
 	void *burst[MAX_BURST] = {0};
 
-	if ( __sync_add_and_fetch(&lcore_count, 1) != 2 )
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
 		while(lcore_count != 2)
 			rte_pause();
 
-- 
2.7.4
^ permalink raw reply	[flat|nested] 56+ messages in thread