DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
@ 2019-06-27  9:37 Ruifeng Wang
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 2/3] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Ruifeng Wang @ 2019-06-27  9:37 UTC (permalink / raw)
  To: bruce.richardson, vladimir.medvedkin
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

Tests showed that the function inlining caused performance drop
on some x86 platforms with the memory ordering patches applied.
By force no-inline functions, the performance was better than
before on x86 and no impact to arm64 platforms.

Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v3: use __rte_noinline to force no inline
v2: initail version to remove 'inline' keyword

 lib/librte_lpm/rte_lpm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6b7b28a2e..eb835f052 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -709,7 +709,7 @@ tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 		uint32_t next_hop)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-27  9:37 [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Ruifeng Wang
@ 2019-06-27  9:37 ` Ruifeng Wang
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
  2019-06-27 15:24 ` [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Stephen Hemminger
  2 siblings, 0 replies; 23+ messages in thread
From: Ruifeng Wang @ 2019-06-27  9:37 UTC (permalink / raw)
  To: bruce.richardson, vladimir.medvedkin
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

The orderings have impact on LPM performance test.
On Arm A72 platform, delete operation has 2.7% degradation, while
add / lookup has no notable performance change.
On x86 E5 platform, add operation has 8% boost, delete operation
has 1% degradation, lookup has no performance change.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v3: updated test data
v2: no changes
v1: initial version

 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++++
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index eb835f052..fabd13fb0 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -1063,7 +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	/* Calculate the range and index into Table24. */
 	tbl24_range = depth_to_range(depth);
 	tbl24_index = (ip_masked >> 8);
+	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
 
 	/*
 	 * Firstly check the sub_rule_index. A -1 indicates no replacement rule
@@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1450,7 +1461,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1713,8 +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
 #undef group_idx
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index b886f54b4..6f5704c5c 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
 	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
 	tbl_entry = *ptbl;
 
+	/* Memory ordering is not required in lookup. Because dataflow
+	 * dependency exists, compiler or HW won't be able to re-order
+	 * the operations.
+	 */
 	/* Copy tbl8 entry (only if needed) */
 	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
 			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-27  9:37 [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Ruifeng Wang
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 2/3] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
@ 2019-06-27  9:37 ` Ruifeng Wang
  2019-06-28 13:33   ` Medvedkin, Vladimir
  2019-06-27 15:24 ` [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Stephen Hemminger
  2 siblings, 1 reply; 23+ messages in thread
From: Ruifeng Wang @ 2019-06-27  9:37 UTC (permalink / raw)
  To: bruce.richardson, vladimir.medvedkin
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
v3: no changes
v2: fixed clang building issue by supplying alignment attribute.
v1: initail version

 lib/librte_lpm/rte_lpm.c | 31 ++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++--
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index fabd13fb0..5f8d494ae 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -938,7 +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1320,7 +1323,14 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				struct rte_lpm_tbl_entry_v20 zero_tbl_entry = {
+						.valid = INVALID,
+						.depth = 0,
+						.valid_group = 0,
+					};
+					zero_tbl_entry.next_hop = 0;
+				__atomic_store(&lpm->tbl24[i], &zero_tbl_entry,
+						__ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1365,7 +1375,8 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1647,8 +1658,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1659,8 +1673,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
 
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 6f5704c5c..98c70ecbe 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -88,7 +88,7 @@ struct rte_lpm_tbl_entry_v20 {
 	 */
 	uint8_t valid_group :1;
 	uint8_t depth       :6; /**< Rule depth. */
-};
+} __rte_aligned(2);
 
 __extension__
 struct rte_lpm_tbl_entry {
@@ -121,7 +121,7 @@ struct rte_lpm_tbl_entry_v20 {
 		uint8_t group_idx;
 		uint8_t next_hop;
 	};
-};
+} __rte_aligned(2);
 
 __extension__
 struct rte_lpm_tbl_entry {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-27  9:37 [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Ruifeng Wang
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 2/3] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
@ 2019-06-27 15:24 ` Stephen Hemminger
  2019-06-28  2:44   ` Ruifeng Wang (Arm Technology China)
  2019-06-28 13:38   ` Medvedkin, Vladimir
  2 siblings, 2 replies; 23+ messages in thread
From: Stephen Hemminger @ 2019-06-27 15:24 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: bruce.richardson, vladimir.medvedkin, dev, honnappa.nagarahalli,
	gavin.hu, nd

On Thu, 27 Jun 2019 17:37:49 +0800
Ruifeng Wang <ruifeng.wang@arm.com> wrote:

> Tests showed that the function inlining caused performance drop
> on some x86 platforms with the memory ordering patches applied.
> By force no-inline functions, the performance was better than
> before on x86 and no impact to arm64 platforms.
> 
> Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
 {

Do you actually need to force noinline or is just taking of inline enough?
In general, letting compiler decide is often best practice.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-27 15:24 ` [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Stephen Hemminger
@ 2019-06-28  2:44   ` Ruifeng Wang (Arm Technology China)
  2019-06-28  4:34     ` Stephen Hemminger
  2019-06-28 13:38   ` Medvedkin, Vladimir
  1 sibling, 1 reply; 23+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-06-28  2:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: bruce.richardson, vladimir.medvedkin, dev, Honnappa Nagarahalli,
	Gavin Hu (Arm Technology China),
	nd, nd

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, June 27, 2019 23:25
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Cc: bruce.richardson@intel.com; vladimir.medvedkin@intel.com;
> dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary
> functions
> 
> On Thu, 27 Jun 2019 17:37:49 +0800
> Ruifeng Wang <ruifeng.wang@arm.com> wrote:
> 
> > Tests showed that the function inlining caused performance drop on
> > some x86 platforms with the memory ordering patches applied.
> > By force no-inline functions, the performance was better than before
> > on x86 and no impact to arm64 platforms.
> >
> > Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
>  {
> 
> Do you actually need to force noinline or is just taking of inline enough?
> In general, letting compiler decide is often best practice.

The force noinline is an optimization for x86 platforms to keep rte_lpm_add() API
performance with memory ordering applied. 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28  2:44   ` Ruifeng Wang (Arm Technology China)
@ 2019-06-28  4:34     ` Stephen Hemminger
  2019-06-28  5:48       ` Ruifeng Wang (Arm Technology China)
  2019-06-28 13:47       ` Medvedkin, Vladimir
  0 siblings, 2 replies; 23+ messages in thread
From: Stephen Hemminger @ 2019-06-28  4:34 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China)
  Cc: bruce.richardson, vladimir.medvedkin, dev, Honnappa Nagarahalli,
	Gavin Hu (Arm Technology China),
	nd

On Fri, 28 Jun 2019 02:44:54 +0000
"Ruifeng Wang (Arm Technology China)" <Ruifeng.Wang@arm.com> wrote:

> >   
> > > Tests showed that the function inlining caused performance drop on
> > > some x86 platforms with the memory ordering patches applied.
> > > By force no-inline functions, the performance was better than before
> > > on x86 and no impact to arm64 platforms.
> > >
> > > Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>  
> >  {
> > 
> > Do you actually need to force noinline or is just taking of inline enough?
> > In general, letting compiler decide is often best practice.  
> 
> The force noinline is an optimization for x86 platforms to keep rte_lpm_add() API
> performance with memory ordering applied. 

I don't think you answered my question. What does a recent version of GCC do
if you drop the inline.

Actually all the functions in rte_lpm should drop inline.

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6b7b28a2e431..ffe07e980864 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct rte_lpm *lpm),
  * are stored in the rule table from 0 - 31.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	uint8_t next_hop)
 {
@@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return rule_index;
 }
 
-static inline int32_t
+static int32_t
 rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 	uint32_t next_hop)
 {
@@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
  * Delete a rule from the rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline void
+static void
 rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 	lpm->rule_info[depth - 1].used_rules--;
 }
 
-static inline void
+static void
 rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
  * Finds a rule in rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -618,7 +618,7 @@ rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -642,7 +642,7 @@ rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 /*
  * Find, clean and allocate a tbl8.
  */
-static inline int32_t
+static int32_t
 tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -669,7 +669,7 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 	return -ENOSPC;
 }
 
-static inline int32_t
+static int32_t
 tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -709,7 +709,7 @@ tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline int32_t
+static int32_t
 add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -1244,7 +1244,7 @@ BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
 MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm, uint32_t ip,
 		uint8_t depth, uint32_t *next_hop), rte_lpm_is_rule_present_v1604);
 
-static inline int32_t
+static int32_t
 find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
  * Return of value > -1 means tbl8 is in use but has all the same values and
  * thus can be recycled
  */
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28  4:34     ` Stephen Hemminger
@ 2019-06-28  5:48       ` Ruifeng Wang (Arm Technology China)
  2019-06-28 13:47       ` Medvedkin, Vladimir
  1 sibling, 0 replies; 23+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-06-28  5:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: bruce.richardson, vladimir.medvedkin, dev, Honnappa Nagarahalli,
	Gavin Hu (Arm Technology China),
	nd, nd

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Friday, June 28, 2019 12:35
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Cc: bruce.richardson@intel.com; vladimir.medvedkin@intel.com;
> dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary
> functions
> 
> On Fri, 28 Jun 2019 02:44:54 +0000
> "Ruifeng Wang (Arm Technology China)" <Ruifeng.Wang@arm.com> wrote:
> 
> > >
> > > > Tests showed that the function inlining caused performance drop on
> > > > some x86 platforms with the memory ordering patches applied.
> > > > By force no-inline functions, the performance was better than
> > > > before on x86 and no impact to arm64 platforms.
> > > >
> > > > Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
> > > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > >  {
> > >
> > > Do you actually need to force noinline or is just taking of inline enough?
> > > In general, letting compiler decide is often best practice.
> >
> > The force noinline is an optimization for x86 platforms to keep
> > rte_lpm_add() API performance with memory ordering applied.
> 
> I don't think you answered my question. What does a recent version of GCC
> do if you drop the inline.
> 
GCC still inlines it when inline is dropped. So we need the force noinline.
See Vladmir's comment in http://patches.dpdk.org/patch/54936/.

> Actually all the functions in rte_lpm should drop inline.
> 
I don't know if we have guideline on use of 'inline'.
In general, 'inline' is not used and it is upon compiler to decide?
Is it reasonable to use force inline or force noinline for performance tunning?

> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> 6b7b28a2e431..ffe07e980864 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct
> rte_lpm *lpm),
>   * are stored in the rule table from 0 - 31.
>   * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>   */
> -static inline int32_t
> +static int32_t
>  rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>  	uint8_t next_hop)
>  {
> @@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t
> ip_masked, uint8_t depth,
>  	return rule_index;
>  }
> 
> -static inline int32_t
> +static int32_t
>  rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>  	uint32_t next_hop)
>  {
> @@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked, uint8_t depth,
>   * Delete a rule from the rule table.
>   * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>   */
> -static inline void
> +static void
>  rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
> {
>  	int i;
> @@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t
> rule_index, uint8_t depth)
>  	lpm->rule_info[depth - 1].used_rules--;  }
> 
> -static inline void
> +static void
>  rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)  {
>  	int i;
> @@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t
> rule_index, uint8_t depth)
>   * Finds a rule in rule table.
>   * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>   */
> -static inline int32_t
> +static int32_t
>  rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
> {
>  	uint32_t rule_gindex, last_rule, rule_index; @@ -618,7 +618,7 @@
> rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
>  	return -EINVAL;
>  }
> 
> -static inline int32_t
> +static int32_t
>  rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)  {
>  	uint32_t rule_gindex, last_rule, rule_index; @@ -642,7 +642,7 @@
> rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
>  /*
>   * Find, clean and allocate a tbl8.
>   */
> -static inline int32_t
> +static int32_t
>  tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)  {
>  	uint32_t group_idx; /* tbl8 group index. */ @@ -669,7 +669,7 @@
> tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
>  	return -ENOSPC;
>  }
> 
> -static inline int32_t
> +static int32_t
>  tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)  {
>  	uint32_t group_idx; /* tbl8 group index. */ @@ -709,7 +709,7 @@
> tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
>  	tbl8[tbl8_group_start].valid_group = INVALID;  }
> 
> -static inline int32_t
> +static int32_t
>  add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>  		uint8_t next_hop)
>  {
> @@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip, uint8_t depth,
>  	return 0;
>  }
> 
> -static inline int32_t
> +static int32_t
>  add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>  		uint32_t next_hop)
>  {
> @@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
>  	return 0;
>  }
> 
> -static inline int32_t
> +static int32_t
>  add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t
> depth,
>  		uint8_t next_hop)
>  {
> @@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked, uint8_t depth,
>  	return 0;
>  }
> 
> -static inline int32_t
> +static int32_t
>  add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t
> depth,
>  		uint32_t next_hop)
>  {
> @@ -1244,7 +1244,7 @@ BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present,
> _v1604, 16.04);  MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct
> rte_lpm *lpm, uint32_t ip,
>  		uint8_t depth, uint32_t *next_hop),
> rte_lpm_is_rule_present_v1604);
> 
> -static inline int32_t
> +static int32_t
>  find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>  		uint8_t *sub_rule_depth)
>  {
> @@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip, uint8_t depth,
>  	return -1;
>  }
> 
> -static inline int32_t
> +static int32_t
>  find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>  		uint8_t *sub_rule_depth)
>  {
> @@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
>  	return -1;
>  }
> 
> -static inline int32_t
> +static int32_t
>  delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>  	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> { @@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20
> *lpm, uint32_t ip_masked,
>  	return 0;
>  }
> 
> -static inline int32_t
> +static int32_t
>  delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>  	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> { @@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
>   * Return of value > -1 means tbl8 is in use but has all the same values and
>   * thus can be recycled
>   */
> -static inline int32_t
> +static int32_t
>  tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
>  		uint32_t tbl8_group_start)
>  {
> @@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct
> rte_lpm_tbl_entry_v20 *tbl8,
>  	return -EINVAL;
>  }
> 
> -static inline int32_t
> +static int32_t
>  tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
>  		uint32_t tbl8_group_start)
>  {
> @@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct
> rte_lpm_tbl_entry *tbl8,
>  	return -EINVAL;
>  }
> 
> -static inline int32_t
> +static int32_t
>  delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>  	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> { @@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked,
>  	return 0;
>  }
> 
> -static inline int32_t
> +static int32_t
>  delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>  	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)  {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
@ 2019-06-28 13:33   ` Medvedkin, Vladimir
  2019-06-29 17:35     ` Honnappa Nagarahalli
  2019-07-01  7:08     ` Ruifeng Wang (Arm Technology China)
  0 siblings, 2 replies; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-28 13:33 UTC (permalink / raw)
  To: Ruifeng Wang, bruce.richardson; +Cc: dev, honnappa.nagarahalli, gavin.hu, nd

Hi Wang,

On 27/06/2019 10:37, Ruifeng Wang wrote:
> When a tbl8 group is getting attached to a tbl24 entry, lookup
> might fail even though the entry is configured in the table.
>
> For ex: consider a LPM table configured with 10.10.10.1/24.
> When a new entry 10.10.10.32/28 is being added, a new tbl8
> group is allocated and tbl24 entry is changed to point to
> the tbl8 group. If the tbl24 entry is written without the tbl8
> group entries updated, a lookup on 10.10.10.9 will return
> failure.
>
> Correct memory orderings are required to ensure that the
> store to tbl24 does not happen before the stores to tbl8 group
> entries complete.
>
> Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
> v3: no changes
> v2: fixed clang building issue by supplying alignment attribute.
> v1: initail version
>
>   lib/librte_lpm/rte_lpm.c | 31 ++++++++++++++++++++++++-------
>   lib/librte_lpm/rte_lpm.h |  4 ++--
>   2 files changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
> index fabd13fb0..5f8d494ae 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>   			/* Setting tbl24 entry in one go to avoid race
>   			 * conditions
>   			 */
> -			lpm->tbl24[i] = new_tbl24_entry;
> +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> +					__ATOMIC_RELEASE);
>   
>   			continue;
>   		}
> @@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   			.depth = 0,
>   		};
>   
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>   
>   	} /* If valid entry but not extended calculate the index into Table8. */
>   	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
> @@ -938,7 +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   				.depth = 0,
>   		};
>   
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>   
>   	} else { /*
>   		* If it is valid, extended entry calculate the index into tbl8.
> @@ -1320,7 +1323,14 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   
>   			if (lpm->tbl24[i].valid_group == 0 &&
>   					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i].valid = INVALID;
> +				struct rte_lpm_tbl_entry_v20 zero_tbl_entry = {
> +						.valid = INVALID,
> +						.depth = 0,
> +						.valid_group = 0,
> +					};
> +					zero_tbl_entry.next_hop = 0;

Please use the same var name in both v20 and v1604 (zero_tbl24_entry). 
The same is for struct initialization. In 1604 you use:

struct rte_lpm_tbl_entry zero_tbl24_entry = {0};

> +				__atomic_store(&lpm->tbl24[i], &zero_tbl_entry,
> +						__ATOMIC_RELEASE);
>   			} else if (lpm->tbl24[i].valid_group == 1) {
>   				/*
>   				 * If TBL24 entry is extended, then there has
> @@ -1365,7 +1375,8 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   
>   			if (lpm->tbl24[i].valid_group == 0 &&
>   					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i] = new_tbl24_entry;
> +				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> +						__ATOMIC_RELEASE);
>   			} else  if (lpm->tbl24[i].valid_group == 1) {
>   				/*
>   				 * If TBL24 entry is extended, then there has
> @@ -1647,8 +1658,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8, tbl8_group_start);
>   
>   	if (tbl8_recycle_index == -EINVAL) {
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>   		lpm->tbl24[tbl24_index].valid = 0;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>   		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
>   	} else if (tbl8_recycle_index > -1) {
>   		/* Update tbl24 entry. */
> @@ -1659,8 +1673,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   			.depth = lpm->tbl8[tbl8_recycle_index].depth,
>   		};
>   
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>   		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>   		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
>   	}
>   
> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> index 6f5704c5c..98c70ecbe 100644
> --- a/lib/librte_lpm/rte_lpm.h
> +++ b/lib/librte_lpm/rte_lpm.h
> @@ -88,7 +88,7 @@ struct rte_lpm_tbl_entry_v20 {
>   	 */
>   	uint8_t valid_group :1;
>   	uint8_t depth       :6; /**< Rule depth. */
> -};
> +} __rte_aligned(2);

I think it is better to __rte_aligned(sizeof(uint16_t)).

>   
>   __extension__
>   struct rte_lpm_tbl_entry {
> @@ -121,7 +121,7 @@ struct rte_lpm_tbl_entry_v20 {
>   		uint8_t group_idx;
>   		uint8_t next_hop;
>   	};
> -};
> +} __rte_aligned(2);
>   
>   __extension__
>   struct rte_lpm_tbl_entry {

As a general remark consider writing all of the tbl entries including 
tbl8 with atomic_store. Now "lpm->tbl8[j] = new_tbl8_entry;" is looks like

      1e9:       44 88 9c 47 40 01 00    mov 
%r11b,0x2000140(%rdi,%rax,2) <-write first byte
      1f0:       02
      1f1:       48 83 c0 01             add    $0x1,%rax
      1f5:       42 88 8c 47 41 01 00    mov %cl,0x2000141(%rdi,%r8,2) 
<-write second byte
      1fc:       02

This may cause an incorrect nexthop to be returned. If the byte with 
valid flag is updated first, the old(and maybe invalid) next hop could 
be returned.

Please evaluate performance drop after.

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-27 15:24 ` [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Stephen Hemminger
  2019-06-28  2:44   ` Ruifeng Wang (Arm Technology China)
@ 2019-06-28 13:38   ` Medvedkin, Vladimir
  1 sibling, 0 replies; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-28 13:38 UTC (permalink / raw)
  To: Stephen Hemminger, Ruifeng Wang
  Cc: bruce.richardson, dev, honnappa.nagarahalli, gavin.hu, nd

Hi Stephen,

On 27/06/2019 16:24, Stephen Hemminger wrote:
> On Thu, 27 Jun 2019 17:37:49 +0800
> Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
>> Tests showed that the function inlining caused performance drop
>> on some x86 platforms with the memory ordering patches applied.
>> By force no-inline functions, the performance was better than
>> before on x86 and no impact to arm64 platforms.
>>
>> Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
>   {
>
> Do you actually need to force noinline or is just taking of inline enough?
> In general, letting compiler decide is often best practice.
In some cases compiler inlines this functions even if they don't have an 
inline qualifier. On some processors it leads to performance drop (maybe 
because of icache trashing).

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28  4:34     ` Stephen Hemminger
  2019-06-28  5:48       ` Ruifeng Wang (Arm Technology China)
@ 2019-06-28 13:47       ` Medvedkin, Vladimir
  2019-06-28 13:57         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-28 13:47 UTC (permalink / raw)
  To: Stephen Hemminger, Ruifeng Wang (Arm Technology China)
  Cc: bruce.richardson, dev, Honnappa Nagarahalli,
	Gavin Hu (Arm Technology China),
	nd

Hi all,

On 28/06/2019 05:34, Stephen Hemminger wrote:
> On Fri, 28 Jun 2019 02:44:54 +0000
> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
>
>>>    
>>>> Tests showed that the function inlining caused performance drop on
>>>> some x86 platforms with the memory ordering patches applied.
>>>> By force no-inline functions, the performance was better than before
>>>> on x86 and no impact to arm64 platforms.
>>>>
>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>   
>>>   {
>>>
>>> Do you actually need to force noinline or is just taking of inline enough?
>>> In general, letting compiler decide is often best practice.
>> The force noinline is an optimization for x86 platforms to keep rte_lpm_add() API
>> performance with memory ordering applied.
> I don't think you answered my question. What does a recent version of GCC do
> if you drop the inline.
>
> Actually all the functions in rte_lpm should drop inline.
I'm agree with Stephen. If it is not a fastpath and size of function is 
not minimal it is good to remove inline qualifier for other control 
plane functions such as rule_add/delete/find/etc and let the compiler 
decide to inline it (unless it affects performance).
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
> index 6b7b28a2e431..ffe07e980864 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct rte_lpm *lpm),
>    * are stored in the rule table from 0 - 31.
>    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>    */
> -static inline int32_t
> +static int32_t
>   rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   	uint8_t next_hop)
>   {
> @@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   	return rule_index;
>   }
>   
> -static inline int32_t
> +static int32_t
>   rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>   	uint32_t next_hop)
>   {
> @@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>    * Delete a rule from the rule table.
>    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>    */
> -static inline void
> +static void
>   rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
>   {
>   	int i;
> @@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
>   	lpm->rule_info[depth - 1].used_rules--;
>   }
>   
> -static inline void
> +static void
>   rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
>   {
>   	int i;
> @@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
>    * Finds a rule in rule table.
>    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>    */
> -static inline int32_t
> +static int32_t
>   rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
>   {
>   	uint32_t rule_gindex, last_rule, rule_index;
> @@ -618,7 +618,7 @@ rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
>   	return -EINVAL;
>   }
>   
> -static inline int32_t
> +static int32_t
>   rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
>   {
>   	uint32_t rule_gindex, last_rule, rule_index;
> @@ -642,7 +642,7 @@ rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
>   /*
>    * Find, clean and allocate a tbl8.
>    */
> -static inline int32_t
> +static int32_t
>   tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
>   {
>   	uint32_t group_idx; /* tbl8 group index. */
> @@ -669,7 +669,7 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
>   	return -ENOSPC;
>   }
>   
> -static inline int32_t
> +static int32_t
>   tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
>   {
>   	uint32_t group_idx; /* tbl8 group index. */
> @@ -709,7 +709,7 @@ tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
>   	tbl8[tbl8_group_start].valid_group = INVALID;
>   }
>   
> -static inline int32_t
> +static int32_t
>   add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>   		uint8_t next_hop)
>   {
> @@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>   	return 0;
>   }
>   
> -static inline int32_t
> +static int32_t
>   add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>   		uint32_t next_hop)
>   {
> @@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>   	return 0;
>   }
>   
> -static inline int32_t
> +static int32_t
>   add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   		uint8_t next_hop)
>   {
> @@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
>   	return 0;
>   }
>   
> -static inline int32_t
> +static int32_t
>   add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>   		uint32_t next_hop)
>   {
> @@ -1244,7 +1244,7 @@ BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
>   MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm, uint32_t ip,
>   		uint8_t depth, uint32_t *next_hop), rte_lpm_is_rule_present_v1604);
>   
> -static inline int32_t
> +static int32_t
>   find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>   		uint8_t *sub_rule_depth)
>   {
> @@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>   	return -1;
>   }
>   
> -static inline int32_t
> +static int32_t
>   find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>   		uint8_t *sub_rule_depth)
>   {
> @@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>   	return -1;
>   }
>   
> -static inline int32_t
> +static int32_t
>   delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>   {
> @@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   	return 0;
>   }
>   
> -static inline int32_t
> +static int32_t
>   delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>   {
> @@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>    * Return of value > -1 means tbl8 is in use but has all the same values and
>    * thus can be recycled
>    */
> -static inline int32_t
> +static int32_t
>   tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
>   		uint32_t tbl8_group_start)
>   {
> @@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
>   	return -EINVAL;
>   }
>   
> -static inline int32_t
> +static int32_t
>   tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
>   		uint32_t tbl8_group_start)
>   {
> @@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
>   	return -EINVAL;
>   }
>   
> -static inline int32_t
> +static int32_t
>   delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>   {
> @@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>   	return 0;
>   }
>   
> -static inline int32_t
> +static int32_t
>   delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>   {

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28 13:47       ` Medvedkin, Vladimir
@ 2019-06-28 13:57         ` Honnappa Nagarahalli
  2019-06-28 14:16           ` Medvedkin, Vladimir
  0 siblings, 1 reply; 23+ messages in thread
From: Honnappa Nagarahalli @ 2019-06-28 13:57 UTC (permalink / raw)
  To: Medvedkin, Vladimir, Stephen Hemminger,
	Ruifeng Wang (Arm Technology China)
  Cc: bruce.richardson, dev, Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd

> Hi all,
> 
> On 28/06/2019 05:34, Stephen Hemminger wrote:
> > On Fri, 28 Jun 2019 02:44:54 +0000
> > "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
> >
> >>>
> >>>> Tests showed that the function inlining caused performance drop on
> >>>> some x86 platforms with the memory ordering patches applied.
> >>>> By force no-inline functions, the performance was better than
> >>>> before on x86 and no impact to arm64 platforms.
> >>>>
> >>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
> >>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
> >>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
> >>>   {
> >>>
> >>> Do you actually need to force noinline or is just taking of inline enough?
> >>> In general, letting compiler decide is often best practice.
> >> The force noinline is an optimization for x86 platforms to keep
> >> rte_lpm_add() API performance with memory ordering applied.
> > I don't think you answered my question. What does a recent version of
> > GCC do if you drop the inline.
> >
> > Actually all the functions in rte_lpm should drop inline.
> I'm agree with Stephen. If it is not a fastpath and size of function is not
> minimal it is good to remove inline qualifier for other control plane functions
> such as rule_add/delete/find/etc and let the compiler decide to inline it
> (unless it affects performance).
IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.

> > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> > 6b7b28a2e431..ffe07e980864 100644
> > --- a/lib/librte_lpm/rte_lpm.c
> > +++ b/lib/librte_lpm/rte_lpm.c
> > @@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct
> rte_lpm *lpm),
> >    * are stored in the rule table from 0 - 31.
> >    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
> >    */
> > -static inline int32_t
> > +static int32_t
> >   rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t
> depth,
> >   	uint8_t next_hop)
> >   {
> > @@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t
> ip_masked, uint8_t depth,
> >   	return rule_index;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
> >   	uint32_t next_hop)
> >   {
> > @@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked, uint8_t depth,
> >    * Delete a rule from the rule table.
> >    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
> >    */
> > -static inline void
> > +static void
> >   rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t
> depth)
> >   {
> >   	int i;
> > @@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t
> rule_index, uint8_t depth)
> >   	lpm->rule_info[depth - 1].used_rules--;
> >   }
> >
> > -static inline void
> > +static void
> >   rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
> >   {
> >   	int i;
> > @@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t
> rule_index, uint8_t depth)
> >    * Finds a rule in rule table.
> >    * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
> >    */
> > -static inline int32_t
> > +static int32_t
> >   rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t
> depth)
> >   {
> >   	uint32_t rule_gindex, last_rule, rule_index; @@ -618,7 +618,7 @@
> > rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
> >   	return -EINVAL;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
> >   {
> >   	uint32_t rule_gindex, last_rule, rule_index; @@ -642,7 +642,7 @@
> > rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
> >   /*
> >    * Find, clean and allocate a tbl8.
> >    */
> > -static inline int32_t
> > +static int32_t
> >   tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
> >   {
> >   	uint32_t group_idx; /* tbl8 group index. */ @@ -669,7 +669,7 @@
> > tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
> >   	return -ENOSPC;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
> >   {
> >   	uint32_t group_idx; /* tbl8 group index. */ @@ -709,7 +709,7 @@
> > tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
> >   	tbl8[tbl8_group_start].valid_group = INVALID;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
> >   		uint8_t next_hop)
> >   {
> > @@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip, uint8_t depth,
> >   	return 0;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
> >   		uint32_t next_hop)
> >   {
> > @@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
> >   	return 0;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
> uint8_t depth,
> >   		uint8_t next_hop)
> >   {
> > @@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked, uint8_t depth,
> >   	return 0;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t
> depth,
> >   		uint32_t next_hop)
> >   {
> > @@ -1244,7 +1244,7 @@
> BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
> >   MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm,
> uint32_t ip,
> >   		uint8_t depth, uint32_t *next_hop),
> > rte_lpm_is_rule_present_v1604);
> >
> > -static inline int32_t
> > +static int32_t
> >   find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t
> depth,
> >   		uint8_t *sub_rule_depth)
> >   {
> > @@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20
> *lpm, uint32_t ip, uint8_t depth,
> >   	return -1;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
> >   		uint8_t *sub_rule_depth)
> >   {
> > @@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
> >   	return -1;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
> >   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> >   {
> > @@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20
> *lpm, uint32_t ip_masked,
> >   	return 0;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
> >   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> >   {
> > @@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
> >    * Return of value > -1 means tbl8 is in use but has all the same values and
> >    * thus can be recycled
> >    */
> > -static inline int32_t
> > +static int32_t
> >   tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
> >   		uint32_t tbl8_group_start)
> >   {
> > @@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct
> rte_lpm_tbl_entry_v20 *tbl8,
> >   	return -EINVAL;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
> >   		uint32_t tbl8_group_start)
> >   {
> > @@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct
> rte_lpm_tbl_entry *tbl8,
> >   	return -EINVAL;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
> >   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> >   {
> > @@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked,
> >   	return 0;
> >   }
> >
> > -static inline int32_t
> > +static int32_t
> >   delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
> >   	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
> >   {
> 
> --
> Regards,
> Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28 13:57         ` Honnappa Nagarahalli
@ 2019-06-28 14:16           ` Medvedkin, Vladimir
  2019-06-28 15:35             ` Stephen Hemminger
  0 siblings, 1 reply; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-28 14:16 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Stephen Hemminger,
	Ruifeng Wang (Arm Technology China)
  Cc: bruce.richardson, dev, Gavin Hu (Arm Technology China), nd

Hi Honnappa,

On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
>> Hi all,
>>
>> On 28/06/2019 05:34, Stephen Hemminger wrote:
>>> On Fri, 28 Jun 2019 02:44:54 +0000
>>> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
>>>
>>>>>> Tests showed that the function inlining caused performance drop on
>>>>>> some x86 platforms with the memory ordering patches applied.
>>>>>> By force no-inline functions, the performance was better than
>>>>>> before on x86 and no impact to arm64 platforms.
>>>>>>
>>>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
>>>>>    {
>>>>>
>>>>> Do you actually need to force noinline or is just taking of inline enough?
>>>>> In general, letting compiler decide is often best practice.
>>>> The force noinline is an optimization for x86 platforms to keep
>>>> rte_lpm_add() API performance with memory ordering applied.
>>> I don't think you answered my question. What does a recent version of
>>> GCC do if you drop the inline.
>>>
>>> Actually all the functions in rte_lpm should drop inline.
>> I'm agree with Stephen. If it is not a fastpath and size of function is not
>> minimal it is good to remove inline qualifier for other control plane functions
>> such as rule_add/delete/find/etc and let the compiler decide to inline it
>> (unless it affects performance).
> IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.
Control plane is not as important as data plane speed but it is still 
important. For lpm we are talking not about initialization, but runtime 
routes add/del related functions. If it is very slow the library will be 
totally unusable because after it receives a route update it will be 
blocked for a long time and route update queue would overflow.
>>> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
>>> 6b7b28a2e431..ffe07e980864 100644
>>> --- a/lib/librte_lpm/rte_lpm.c
>>> +++ b/lib/librte_lpm/rte_lpm.c
>>> @@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct
>> rte_lpm *lpm),
>>>     * are stored in the rule table from 0 - 31.
>>>     * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>>>     */
>>> -static inline int32_t
>>> +static int32_t
>>>    rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t
>> depth,
>>>    	uint8_t next_hop)
>>>    {
>>> @@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t
>> ip_masked, uint8_t depth,
>>>    	return rule_index;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>>>    	uint32_t next_hop)
>>>    {
>>> @@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t
>> ip_masked, uint8_t depth,
>>>     * Delete a rule from the rule table.
>>>     * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>>>     */
>>> -static inline void
>>> +static void
>>>    rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t
>> depth)
>>>    {
>>>    	int i;
>>> @@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t
>> rule_index, uint8_t depth)
>>>    	lpm->rule_info[depth - 1].used_rules--;
>>>    }
>>>
>>> -static inline void
>>> +static void
>>>    rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
>>>    {
>>>    	int i;
>>> @@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t
>> rule_index, uint8_t depth)
>>>     * Finds a rule in rule table.
>>>     * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
>>>     */
>>> -static inline int32_t
>>> +static int32_t
>>>    rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t
>> depth)
>>>    {
>>>    	uint32_t rule_gindex, last_rule, rule_index; @@ -618,7 +618,7 @@
>>> rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
>>>    	return -EINVAL;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
>>>    {
>>>    	uint32_t rule_gindex, last_rule, rule_index; @@ -642,7 +642,7 @@
>>> rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
>>>    /*
>>>     * Find, clean and allocate a tbl8.
>>>     */
>>> -static inline int32_t
>>> +static int32_t
>>>    tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
>>>    {
>>>    	uint32_t group_idx; /* tbl8 group index. */ @@ -669,7 +669,7 @@
>>> tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
>>>    	return -ENOSPC;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
>>>    {
>>>    	uint32_t group_idx; /* tbl8 group index. */ @@ -709,7 +709,7 @@
>>> tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
>>>    	tbl8[tbl8_group_start].valid_group = INVALID;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
>>>    		uint8_t next_hop)
>>>    {
>>> @@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm,
>> uint32_t ip, uint8_t depth,
>>>    	return 0;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>>>    		uint32_t next_hop)
>>>    {
>>> @@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm,
>> uint32_t ip, uint8_t depth,
>>>    	return 0;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>> uint8_t depth,
>>>    		uint8_t next_hop)
>>>    {
>>> @@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
>> uint32_t ip_masked, uint8_t depth,
>>>    	return 0;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t
>> depth,
>>>    		uint32_t next_hop)
>>>    {
>>> @@ -1244,7 +1244,7 @@
>> BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
>>>    MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm,
>> uint32_t ip,
>>>    		uint8_t depth, uint32_t *next_hop),
>>> rte_lpm_is_rule_present_v1604);
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t
>> depth,
>>>    		uint8_t *sub_rule_depth)
>>>    {
>>> @@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20
>> *lpm, uint32_t ip, uint8_t depth,
>>>    	return -1;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>>>    		uint8_t *sub_rule_depth)
>>>    {
>>> @@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm,
>> uint32_t ip, uint8_t depth,
>>>    	return -1;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>>>    	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>>>    {
>>> @@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20
>> *lpm, uint32_t ip_masked,
>>>    	return 0;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>>>    	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>>>    {
>>> @@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
>> uint32_t ip_masked,
>>>     * Return of value > -1 means tbl8 is in use but has all the same values and
>>>     * thus can be recycled
>>>     */
>>> -static inline int32_t
>>> +static int32_t
>>>    tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
>>>    		uint32_t tbl8_group_start)
>>>    {
>>> @@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct
>> rte_lpm_tbl_entry_v20 *tbl8,
>>>    	return -EINVAL;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
>>>    		uint32_t tbl8_group_start)
>>>    {
>>> @@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct
>> rte_lpm_tbl_entry *tbl8,
>>>    	return -EINVAL;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
>>>    	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>>>    {
>>> @@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm,
>> uint32_t ip_masked,
>>>    	return 0;
>>>    }
>>>
>>> -static inline int32_t
>>> +static int32_t
>>>    delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>>>    	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
>>>    {
>> --
>> Regards,
>> Vladimir

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28 14:16           ` Medvedkin, Vladimir
@ 2019-06-28 15:35             ` Stephen Hemminger
  2019-07-01  6:44               ` Ruifeng Wang (Arm Technology China)
  2019-07-05 10:31               ` Medvedkin, Vladimir
  0 siblings, 2 replies; 23+ messages in thread
From: Stephen Hemminger @ 2019-06-28 15:35 UTC (permalink / raw)
  To: Medvedkin, Vladimir
  Cc: Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

On Fri, 28 Jun 2019 15:16:30 +0100
"Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:

> Hi Honnappa,
> 
> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
> >> Hi all,
> >>
> >> On 28/06/2019 05:34, Stephen Hemminger wrote:  
> >>> On Fri, 28 Jun 2019 02:44:54 +0000
> >>> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
> >>>  
> >>>>>> Tests showed that the function inlining caused performance drop on
> >>>>>> some x86 platforms with the memory ordering patches applied.
> >>>>>> By force no-inline functions, the performance was better than
> >>>>>> before on x86 and no impact to arm64 platforms.
> >>>>>>
> >>>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
> >>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
> >>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>  
> >>>>>    {
> >>>>>
> >>>>> Do you actually need to force noinline or is just taking of inline enough?
> >>>>> In general, letting compiler decide is often best practice.  
> >>>> The force noinline is an optimization for x86 platforms to keep
> >>>> rte_lpm_add() API performance with memory ordering applied.  
> >>> I don't think you answered my question. What does a recent version of
> >>> GCC do if you drop the inline.
> >>>
> >>> Actually all the functions in rte_lpm should drop inline.  
> >> I'm agree with Stephen. If it is not a fastpath and size of function is not
> >> minimal it is good to remove inline qualifier for other control plane functions
> >> such as rule_add/delete/find/etc and let the compiler decide to inline it
> >> (unless it affects performance).  
> > IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.  
> Control plane is not as important as data plane speed but it is still 
> important. For lpm we are talking not about initialization, but runtime 
> routes add/del related functions. If it is very slow the library will be 
> totally unusable because after it receives a route update it will be 
> blocked for a long time and route update queue would overflow.

Control plane performance is more impacted by algorithmic choice.
The original LPM had terrible (n^2?) control path. Current code is better.
I had a patch using RB tree, but it was rejected because it used the
/usr/include/bsd/sys/tree.h which added a dependency.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-28 13:33   ` Medvedkin, Vladimir
@ 2019-06-29 17:35     ` Honnappa Nagarahalli
  2019-07-05 13:45       ` Alex Kiselev
  2019-07-01  7:08     ` Ruifeng Wang (Arm Technology China)
  1 sibling, 1 reply; 23+ messages in thread
From: Honnappa Nagarahalli @ 2019-06-29 17:35 UTC (permalink / raw)
  To: Medvedkin, Vladimir, Ruifeng Wang (Arm Technology China),
	bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China), Honnappa Nagarahalli, nd, nd

<snip>

> 
> As a general remark consider writing all of the tbl entries including
> tbl8 with atomic_store. Now "lpm->tbl8[j] = new_tbl8_entry;" is looks like
> 
>       1e9:       44 88 9c 47 40 01 00    mov
> %r11b,0x2000140(%rdi,%rax,2) <-write first byte
>       1f0:       02
>       1f1:       48 83 c0 01             add    $0x1,%rax
>       1f5:       42 88 8c 47 41 01 00    mov %cl,0x2000141(%rdi,%r8,2) <-write
> second byte
>       1fc:       02
> 
> This may cause an incorrect nexthop to be returned. If the byte with valid flag
> is updated first, the old(and maybe invalid) next hop could be returned.
+1

It is surprising that the compiler is not generating a single 32b store. As you mentioned 'relaxed' __atomic_store_n should be good.

> 
> Please evaluate performance drop after.
> 
> --
> Regards,
> Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28 15:35             ` Stephen Hemminger
@ 2019-07-01  6:44               ` Ruifeng Wang (Arm Technology China)
  2019-07-05 10:40                 ` Medvedkin, Vladimir
  2019-07-05 10:31               ` Medvedkin, Vladimir
  1 sibling, 1 reply; 23+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-07-01  6:44 UTC (permalink / raw)
  To: Medvedkin, Vladimir
  Cc: Honnappa Nagarahalli, Stephen Hemminger, bruce.richardson, dev,
	Gavin Hu (Arm Technology China),
	nd, nd

Hi Medvedkin,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Friday, June 28, 2019 23:35
> To: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> (Arm Technology China) <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; dev@dpdk.org; Gavin Hu (Arm Technology
> China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary
> functions
> 
> On Fri, 28 Jun 2019 15:16:30 +0100
> "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
> 
> > Hi Honnappa,
> >
> > On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
> > >> Hi all,
> > >>
> > >> On 28/06/2019 05:34, Stephen Hemminger wrote:
> > >>> On Fri, 28 Jun 2019 02:44:54 +0000 "Ruifeng Wang (Arm Technology
> > >>> China)"<Ruifeng.Wang@arm.com>  wrote:
> > >>>
> > >>>>>> Tests showed that the function inlining caused performance drop
> > >>>>>> on some x86 platforms with the memory ordering patches applied.
> > >>>>>> By force no-inline functions, the performance was better than
> > >>>>>> before on x86 and no impact to arm64 platforms.
> > >>>>>>
> > >>>>>> Suggested-by: Medvedkin
> Vladimir<vladimir.medvedkin@intel.com>
> > >>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
> > >>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
> > >>>>>    {
> > >>>>>
> > >>>>> Do you actually need to force noinline or is just taking of inline
> enough?
> > >>>>> In general, letting compiler decide is often best practice.
> > >>>> The force noinline is an optimization for x86 platforms to keep
> > >>>> rte_lpm_add() API performance with memory ordering applied.
> > >>> I don't think you answered my question. What does a recent version
> > >>> of GCC do if you drop the inline.
> > >>>
> > >>> Actually all the functions in rte_lpm should drop inline.
> > >> I'm agree with Stephen. If it is not a fastpath and size of
> > >> function is not minimal it is good to remove inline qualifier for
> > >> other control plane functions such as rule_add/delete/find/etc and
> > >> let the compiler decide to inline it (unless it affects performance).
> > > IMO, the rule needs to be simple. If it is control plane function, we should
> leave it to the compiler to decide. I do not think we need to worry too much
> about performance for control plane functions.
> > Control plane is not as important as data plane speed but it is still
> > important. For lpm we are talking not about initialization, but
> > runtime routes add/del related functions. If it is very slow the
> > library will be totally unusable because after it receives a route
> > update it will be blocked for a long time and route update queue would
> overflow.
> 
> Control plane performance is more impacted by algorithmic choice.
> The original LPM had terrible (n^2?) control path. Current code is better.
> I had a patch using RB tree, but it was rejected because it used the
> /usr/include/bsd/sys/tree.h which added a dependency.

Based on current discussion, I'd like to drop this single patch from the patch set.
Since it is not directly related to memory ordering changes in this library.
We can remove inlines in a follow up patch.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-28 13:33   ` Medvedkin, Vladimir
  2019-06-29 17:35     ` Honnappa Nagarahalli
@ 2019-07-01  7:08     ` Ruifeng Wang (Arm Technology China)
  1 sibling, 0 replies; 23+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-07-01  7:08 UTC (permalink / raw)
  To: Medvedkin, Vladimir, bruce.richardson
  Cc: dev, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, nd

Hi Vladimir,

> -----Original Message-----
> From: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> Sent: Friday, June 28, 2019 21:34
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v3 3/3] lib/lpm: memory orderings to avoid race
> conditions for v20
> 
> Hi Wang,
> 
> On 27/06/2019 10:37, Ruifeng Wang wrote:
> > When a tbl8 group is getting attached to a tbl24 entry, lookup might
> > fail even though the entry is configured in the table.
> >
> > For ex: consider a LPM table configured with 10.10.10.1/24.
> > When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
> > allocated and tbl24 entry is changed to point to the tbl8 group. If
> > the tbl24 entry is written without the tbl8 group entries updated, a
> > lookup on 10.10.10.9 will return failure.
> >
> > Correct memory orderings are required to ensure that the store to
> > tbl24 does not happen before the stores to tbl8 group entries
> > complete.
> >
> > Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > ---
> > v3: no changes
> > v2: fixed clang building issue by supplying alignment attribute.
> > v1: initail version
> >
> >   lib/librte_lpm/rte_lpm.c | 31 ++++++++++++++++++++++++-------
> >   lib/librte_lpm/rte_lpm.h |  4 ++--
> >   2 files changed, 26 insertions(+), 9 deletions(-)
> >
> > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> > fabd13fb0..5f8d494ae 100644
> > --- a/lib/librte_lpm/rte_lpm.c
> > +++ b/lib/librte_lpm/rte_lpm.c
> > @@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip, uint8_t depth,
> >   			/* Setting tbl24 entry in one go to avoid race
> >   			 * conditions
> >   			 */
> > -			lpm->tbl24[i] = new_tbl24_entry;
> > +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> > +					__ATOMIC_RELEASE);
> >
> >   			continue;
> >   		}
> > @@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked, uint8_t depth,
> >   			.depth = 0,
> >   		};
> >
> > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> > +				__ATOMIC_RELEASE);
> >
> >   	} /* If valid entry but not extended calculate the index into Table8. */
> >   	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -938,7
> > +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t
> ip_masked, uint8_t depth,
> >   				.depth = 0,
> >   		};
> >
> > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> > +				__ATOMIC_RELEASE);
> >
> >   	} else { /*
> >   		* If it is valid, extended entry calculate the index into tbl8.
> > @@ -1320,7 +1323,14 @@ delete_depth_small_v20(struct rte_lpm_v20
> *lpm,
> > uint32_t ip_masked,
> >
> >   			if (lpm->tbl24[i].valid_group == 0 &&
> >   					lpm->tbl24[i].depth <= depth) {
> > -				lpm->tbl24[i].valid = INVALID;
> > +				struct rte_lpm_tbl_entry_v20 zero_tbl_entry
> = {
> > +						.valid = INVALID,
> > +						.depth = 0,
> > +						.valid_group = 0,
> > +					};
> > +					zero_tbl_entry.next_hop = 0;
> 
> Please use the same var name in both v20 and v1604 (zero_tbl24_entry).
> The same is for struct initialization. In 1604 you use:
> 
> struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
> 
Accept on the var name.
However, similar struct initialization as in v1604 will cause meson build failure.
This should be related to struct with union members been used for both big-endian and little-endian systems.
Referred to commit: e13676bfe7.

> > +				__atomic_store(&lpm->tbl24[i],
> &zero_tbl_entry,
> > +						__ATOMIC_RELEASE);
> >   			} else if (lpm->tbl24[i].valid_group == 1) {
> >   				/*
> >   				 * If TBL24 entry is extended, then there has
> @@ -1365,7 +1375,8
> > @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t
> ip_masked,
> >
> >   			if (lpm->tbl24[i].valid_group == 0 &&
> >   					lpm->tbl24[i].depth <= depth) {
> > -				lpm->tbl24[i] = new_tbl24_entry;
> > +				__atomic_store(&lpm->tbl24[i],
> &new_tbl24_entry,
> > +						__ATOMIC_RELEASE);
> >   			} else  if (lpm->tbl24[i].valid_group == 1) {
> >   				/*
> >   				 * If TBL24 entry is extended, then there has
> @@ -1647,8
> > +1658,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t
> ip_masked,
> >   	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8,
> > tbl8_group_start);
> >
> >   	if (tbl8_recycle_index == -EINVAL) {
> > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > +		 * Prevent the free of the tbl8 group from hoisting.
> > +		 */
> >   		lpm->tbl24[tbl24_index].valid = 0;
> > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> >   		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
> >   	} else if (tbl8_recycle_index > -1) {
> >   		/* Update tbl24 entry. */
> > @@ -1659,8 +1673,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked,
> >   			.depth = lpm->tbl8[tbl8_recycle_index].depth,
> >   		};
> >
> > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > +		 * Prevent the free of the tbl8 group from hoisting.
> > +		 */
> >   		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> >   		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
> >   	}
> >
> > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h index
> > 6f5704c5c..98c70ecbe 100644
> > --- a/lib/librte_lpm/rte_lpm.h
> > +++ b/lib/librte_lpm/rte_lpm.h
> > @@ -88,7 +88,7 @@ struct rte_lpm_tbl_entry_v20 {
> >   	 */
> >   	uint8_t valid_group :1;
> >   	uint8_t depth       :6; /**< Rule depth. */
> > -};
> > +} __rte_aligned(2);
> 
> I think it is better to __rte_aligned(sizeof(uint16_t)).
> 
Will include this change in next version.

> >
> >   __extension__
> >   struct rte_lpm_tbl_entry {
> > @@ -121,7 +121,7 @@ struct rte_lpm_tbl_entry_v20 {
> >   		uint8_t group_idx;
> >   		uint8_t next_hop;
> >   	};
> > -};
> > +} __rte_aligned(2);
> >
> >   __extension__
> >   struct rte_lpm_tbl_entry {
> 
> As a general remark consider writing all of the tbl entries including
> tbl8 with atomic_store. Now "lpm->tbl8[j] = new_tbl8_entry;" is looks like
> 
>       1e9:       44 88 9c 47 40 01 00    mov
> %r11b,0x2000140(%rdi,%rax,2) <-write first byte
>       1f0:       02
>       1f1:       48 83 c0 01             add    $0x1,%rax
>       1f5:       42 88 8c 47 41 01 00    mov %cl,0x2000141(%rdi,%r8,2) <-write
> second byte
>       1fc:       02
> 
> This may cause an incorrect nexthop to be returned. If the byte with valid
> flag is updated first, the old(and maybe invalid) next hop could be returned.
> 
> Please evaluate performance drop after.
> 
Thanks for looking into this. It is a surprise compiler doesn't generate atomic store for this data type.
Will include the change in next version.

> --
> Regards,
> Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-06-28 15:35             ` Stephen Hemminger
  2019-07-01  6:44               ` Ruifeng Wang (Arm Technology China)
@ 2019-07-05 10:31               ` Medvedkin, Vladimir
  2019-07-05 13:37                 ` Alex Kiselev
  1 sibling, 1 reply; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-07-05 10:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

Hi Stephen,

On 28/06/2019 16:35, Stephen Hemminger wrote:
> On Fri, 28 Jun 2019 15:16:30 +0100
> "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
>
>> Hi Honnappa,
>>
>> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
>>>> Hi all,
>>>>
>>>> On 28/06/2019 05:34, Stephen Hemminger wrote:
>>>>> On Fri, 28 Jun 2019 02:44:54 +0000
>>>>> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
>>>>>   
>>>>>>>> Tests showed that the function inlining caused performance drop on
>>>>>>>> some x86 platforms with the memory ordering patches applied.
>>>>>>>> By force no-inline functions, the performance was better than
>>>>>>>> before on x86 and no impact to arm64 platforms.
>>>>>>>>
>>>>>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
>>>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
>>>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
>>>>>>>     {
>>>>>>>
>>>>>>> Do you actually need to force noinline or is just taking of inline enough?
>>>>>>> In general, letting compiler decide is often best practice.
>>>>>> The force noinline is an optimization for x86 platforms to keep
>>>>>> rte_lpm_add() API performance with memory ordering applied.
>>>>> I don't think you answered my question. What does a recent version of
>>>>> GCC do if you drop the inline.
>>>>>
>>>>> Actually all the functions in rte_lpm should drop inline.
>>>> I'm agree with Stephen. If it is not a fastpath and size of function is not
>>>> minimal it is good to remove inline qualifier for other control plane functions
>>>> such as rule_add/delete/find/etc and let the compiler decide to inline it
>>>> (unless it affects performance).
>>> IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.
>> Control plane is not as important as data plane speed but it is still
>> important. For lpm we are talking not about initialization, but runtime
>> routes add/del related functions. If it is very slow the library will be
>> totally unusable because after it receives a route update it will be
>> blocked for a long time and route update queue would overflow.
> Control plane performance is more impacted by algorithmic choice.
> The original LPM had terrible (n^2?) control path. Current code is better.
> I had a patch using RB tree, but it was rejected because it used the
> /usr/include/bsd/sys/tree.h which added a dependency.

You're absolutely right,  control plane performance is mostly depends on 
algorithm. Current LPM implementation has number of problems there. One 
problem is rules_tbl[] that is a flat array containing routes for 
control plane purposes. Replacing it with a rb-tree solves this problem, 
but there are another problems. For example, when you try to add a route 
10.0.0.0/8 while a number of subroutes are exist in the table (for 
example 10.20.0.0/16), current implementation will load tbl_entry -> do 
some checks (depth, ext entry) -> conditionally store new entry. Under 
several circumstances it would take a lot time.  But in fact it needs to 
unconditionally rewrite only two ranges - from 10.0.0.0 to 10.19.255.255 
and from 10.21.0.0 to 10.255.255.255. And control plane could help us to 
get this two ranges. The best struct to do so is lc-tree because it is 
relatively easy to traverse subtree (described by 10.0.0.0/8) and get 
subroutes. We are working on a new implementation, hopefully it will be 
ready soon.

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-07-01  6:44               ` Ruifeng Wang (Arm Technology China)
@ 2019-07-05 10:40                 ` Medvedkin, Vladimir
  2019-07-05 10:58                   ` Ruifeng Wang (Arm Technology China)
  0 siblings, 1 reply; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-07-05 10:40 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China)
  Cc: Honnappa Nagarahalli, Stephen Hemminger, bruce.richardson, dev,
	Gavin Hu (Arm Technology China),
	nd


On 01/07/2019 07:44, Ruifeng Wang (Arm Technology China) wrote:
> Hi Medvedkin,
>
>> -----Original Message-----
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Sent: Friday, June 28, 2019 23:35
>> To: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
>> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
>> (Arm Technology China) <Ruifeng.Wang@arm.com>;
>> bruce.richardson@intel.com; dev@dpdk.org; Gavin Hu (Arm Technology
>> China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
>> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary
>> functions
>>
>> On Fri, 28 Jun 2019 15:16:30 +0100
>> "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
>>
>>> Hi Honnappa,
>>>
>>> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
>>>>> Hi all,
>>>>>
>>>>> On 28/06/2019 05:34, Stephen Hemminger wrote:
>>>>>> On Fri, 28 Jun 2019 02:44:54 +0000 "Ruifeng Wang (Arm Technology
>>>>>> China)"<Ruifeng.Wang@arm.com>  wrote:
>>>>>>
>>>>>>>>> Tests showed that the function inlining caused performance drop
>>>>>>>>> on some x86 platforms with the memory ordering patches applied.
>>>>>>>>> By force no-inline functions, the performance was better than
>>>>>>>>> before on x86 and no impact to arm64 platforms.
>>>>>>>>>
>>>>>>>>> Suggested-by: Medvedkin
>> Vladimir<vladimir.medvedkin@intel.com>
>>>>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
>>>>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
>>>>>>>>     {
>>>>>>>>
>>>>>>>> Do you actually need to force noinline or is just taking of inline
>> enough?
>>>>>>>> In general, letting compiler decide is often best practice.
>>>>>>> The force noinline is an optimization for x86 platforms to keep
>>>>>>> rte_lpm_add() API performance with memory ordering applied.
>>>>>> I don't think you answered my question. What does a recent version
>>>>>> of GCC do if you drop the inline.
>>>>>>
>>>>>> Actually all the functions in rte_lpm should drop inline.
>>>>> I'm agree with Stephen. If it is not a fastpath and size of
>>>>> function is not minimal it is good to remove inline qualifier for
>>>>> other control plane functions such as rule_add/delete/find/etc and
>>>>> let the compiler decide to inline it (unless it affects performance).
>>>> IMO, the rule needs to be simple. If it is control plane function, we should
>> leave it to the compiler to decide. I do not think we need to worry too much
>> about performance for control plane functions.
>>> Control plane is not as important as data plane speed but it is still
>>> important. For lpm we are talking not about initialization, but
>>> runtime routes add/del related functions. If it is very slow the
>>> library will be totally unusable because after it receives a route
>>> update it will be blocked for a long time and route update queue would
>> overflow.
>>
>> Control plane performance is more impacted by algorithmic choice.
>> The original LPM had terrible (n^2?) control path. Current code is better.
>> I had a patch using RB tree, but it was rejected because it used the
>> /usr/include/bsd/sys/tree.h which added a dependency.
> Based on current discussion, I'd like to drop this single patch from the patch set.
> Since it is not directly related to memory ordering changes in this library.
> We can remove inlines in a follow up patch.
I think this patch is indirectly related to changes. I can't accept a 
memory ordering patch series _before_ this patch because a repository 
state will appear in which the performance of LPM add/delete has 
dropped. So if it could be avoided it have to be avoided.

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-07-05 10:40                 ` Medvedkin, Vladimir
@ 2019-07-05 10:58                   ` Ruifeng Wang (Arm Technology China)
  0 siblings, 0 replies; 23+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-07-05 10:58 UTC (permalink / raw)
  To: Medvedkin, Vladimir
  Cc: Honnappa Nagarahalli, Stephen Hemminger, bruce.richardson, dev,
	Gavin Hu (Arm Technology China),
	nd, nd


> -----Original Message-----
> From: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> Sent: Friday, July 5, 2019 18:41
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Stephen
> Hemminger <stephen@networkplumber.org>; bruce.richardson@intel.com;
> dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd
> <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary
> functions
> 
> 
> On 01/07/2019 07:44, Ruifeng Wang (Arm Technology China) wrote:
> > Hi Medvedkin,
> >
> >> -----Original Message-----
> >> From: Stephen Hemminger <stephen@networkplumber.org>
> >> Sent: Friday, June 28, 2019 23:35
> >> To: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> >> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng
> Wang
> >> (Arm Technology China) <Ruifeng.Wang@arm.com>;
> >> bruce.richardson@intel.com; dev@dpdk.org; Gavin Hu (Arm Technology
> >> China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
> >> Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline
> >> unnecessary functions
> >>
> >> On Fri, 28 Jun 2019 15:16:30 +0100
> >> "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
> >>
> >>> Hi Honnappa,
> >>>
> >>> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> On 28/06/2019 05:34, Stephen Hemminger wrote:
> >>>>>> On Fri, 28 Jun 2019 02:44:54 +0000 "Ruifeng Wang (Arm Technology
> >>>>>> China)"<Ruifeng.Wang@arm.com>  wrote:
> >>>>>>
> >>>>>>>>> Tests showed that the function inlining caused performance
> >>>>>>>>> drop on some x86 platforms with the memory ordering patches
> applied.
> >>>>>>>>> By force no-inline functions, the performance was better than
> >>>>>>>>> before on x86 and no impact to arm64 platforms.
> >>>>>>>>>
> >>>>>>>>> Suggested-by: Medvedkin
> >> Vladimir<vladimir.medvedkin@intel.com>
> >>>>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
> >>>>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
> >>>>>>>>     {
> >>>>>>>>
> >>>>>>>> Do you actually need to force noinline or is just taking of
> >>>>>>>> inline
> >> enough?
> >>>>>>>> In general, letting compiler decide is often best practice.
> >>>>>>> The force noinline is an optimization for x86 platforms to keep
> >>>>>>> rte_lpm_add() API performance with memory ordering applied.
> >>>>>> I don't think you answered my question. What does a recent
> >>>>>> version of GCC do if you drop the inline.
> >>>>>>
> >>>>>> Actually all the functions in rte_lpm should drop inline.
> >>>>> I'm agree with Stephen. If it is not a fastpath and size of
> >>>>> function is not minimal it is good to remove inline qualifier for
> >>>>> other control plane functions such as rule_add/delete/find/etc and
> >>>>> let the compiler decide to inline it (unless it affects performance).
> >>>> IMO, the rule needs to be simple. If it is control plane function,
> >>>> we should
> >> leave it to the compiler to decide. I do not think we need to worry
> >> too much about performance for control plane functions.
> >>> Control plane is not as important as data plane speed but it is
> >>> still important. For lpm we are talking not about initialization,
> >>> but runtime routes add/del related functions. If it is very slow the
> >>> library will be totally unusable because after it receives a route
> >>> update it will be blocked for a long time and route update queue
> >>> would
> >> overflow.
> >>
> >> Control plane performance is more impacted by algorithmic choice.
> >> The original LPM had terrible (n^2?) control path. Current code is better.
> >> I had a patch using RB tree, but it was rejected because it used the
> >> /usr/include/bsd/sys/tree.h which added a dependency.
> > Based on current discussion, I'd like to drop this single patch from the patch
> set.
> > Since it is not directly related to memory ordering changes in this library.
> > We can remove inlines in a follow up patch.
> I think this patch is indirectly related to changes. I can't accept a memory
> ordering patch series _before_ this patch because a repository state will
> appear in which the performance of LPM add/delete has dropped. So if it
> could be avoided it have to be avoided.
> 
In patch set v4, I dropped this patch and added atomic relaxed store for other
table entries. Test result showed no performance drop with the whole patch set.
Test data was updated in commit message of the first patch.
Please have a check. Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-07-05 10:31               ` Medvedkin, Vladimir
@ 2019-07-05 13:37                 ` Alex Kiselev
  2019-07-05 16:53                   ` Medvedkin, Vladimir
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Kiselev @ 2019-07-05 13:37 UTC (permalink / raw)
  To: Medvedkin, Vladimir
  Cc: Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

пт, 5 июл. 2019 г. в 13:31, Medvedkin, Vladimir <vladimir.medvedkin@intel.com>:
>
> Hi Stephen,
>
> On 28/06/2019 16:35, Stephen Hemminger wrote:
> > On Fri, 28 Jun 2019 15:16:30 +0100
> > "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
> >
> >> Hi Honnappa,
> >>
> >> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
> >>>> Hi all,
> >>>>
> >>>> On 28/06/2019 05:34, Stephen Hemminger wrote:
> >>>>> On Fri, 28 Jun 2019 02:44:54 +0000
> >>>>> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
> >>>>>
> >>>>>>>> Tests showed that the function inlining caused performance drop on
> >>>>>>>> some x86 platforms with the memory ordering patches applied.
> >>>>>>>> By force no-inline functions, the performance was better than
> >>>>>>>> before on x86 and no impact to arm64 platforms.
> >>>>>>>>
> >>>>>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
> >>>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
> >>>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
> >>>>>>>     {
> >>>>>>>
> >>>>>>> Do you actually need to force noinline or is just taking of inline enough?
> >>>>>>> In general, letting compiler decide is often best practice.
> >>>>>> The force noinline is an optimization for x86 platforms to keep
> >>>>>> rte_lpm_add() API performance with memory ordering applied.
> >>>>> I don't think you answered my question. What does a recent version of
> >>>>> GCC do if you drop the inline.
> >>>>>
> >>>>> Actually all the functions in rte_lpm should drop inline.
> >>>> I'm agree with Stephen. If it is not a fastpath and size of function is not
> >>>> minimal it is good to remove inline qualifier for other control plane functions
> >>>> such as rule_add/delete/find/etc and let the compiler decide to inline it
> >>>> (unless it affects performance).
> >>> IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.
> >> Control plane is not as important as data plane speed but it is still
> >> important. For lpm we are talking not about initialization, but runtime
> >> routes add/del related functions. If it is very slow the library will be
> >> totally unusable because after it receives a route update it will be
> >> blocked for a long time and route update queue would overflow.
> > Control plane performance is more impacted by algorithmic choice.
> > The original LPM had terrible (n^2?) control path. Current code is better.
> > I had a patch using RB tree, but it was rejected because it used the
> > /usr/include/bsd/sys/tree.h which added a dependency.
>
> You're absolutely right,  control plane performance is mostly depends on
> algorithm. Current LPM implementation has number of problems there. One
> problem is rules_tbl[] that is a flat array containing routes for
> control plane purposes. Replacing it with a rb-tree solves this problem,
> but there are another problems. For example, when you try to add a route
> 10.0.0.0/8 while a number of subroutes are exist in the table (for
> example 10.20.0.0/16), current implementation will load tbl_entry -> do
> some checks (depth, ext entry) -> conditionally store new entry. Under
> several circumstances it would take a lot time.  But in fact it needs to
> unconditionally rewrite only two ranges - from 10.0.0.0 to 10.19.255.255
> and from 10.21.0.0 to 10.255.255.255. And control plane could help us to
> get this two ranges. The best struct to do so is lc-tree because it is
> relatively easy to traverse subtree (described by 10.0.0.0/8) and get
> subroutes. We are working on a new implementation, hopefully it will be
> ready soon.

Have you considered switching to this algorithm?
http://www.nxlab.fer.hr/dxr/

-- 
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-29 17:35     ` Honnappa Nagarahalli
@ 2019-07-05 13:45       ` Alex Kiselev
  2019-07-05 16:56         ` Medvedkin, Vladimir
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Kiselev @ 2019-07-05 13:45 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Medvedkin, Vladimir, Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

Hi,

> <snip>
>
> >
> > As a general remark consider writing all of the tbl entries including
> > tbl8 with atomic_store. Now "lpm->tbl8[j] = new_tbl8_entry;" is looks like
> >
> >       1e9:       44 88 9c 47 40 01 00    mov
> > %r11b,0x2000140(%rdi,%rax,2) <-write first byte
> >       1f0:       02
> >       1f1:       48 83 c0 01             add    $0x1,%rax
> >       1f5:       42 88 8c 47 41 01 00    mov %cl,0x2000141(%rdi,%r8,2) <-write
> > second byte
> >       1fc:       02
> >
> > This may cause an incorrect nexthop to be returned. If the byte with valid flag
> > is updated first, the old(and maybe invalid) next hop could be returned.
> +1
>
> It is surprising that the compiler is not generating a single 32b store. As you mentioned 'relaxed' __atomic_store_n should be good.

Am I right that x86 platform is not affected by the bug since
store-store could not be reordered on x86?

-- 
Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions
  2019-07-05 13:37                 ` Alex Kiselev
@ 2019-07-05 16:53                   ` Medvedkin, Vladimir
  0 siblings, 0 replies; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-07-05 16:53 UTC (permalink / raw)
  To: Alex Kiselev
  Cc: Stephen Hemminger, Honnappa Nagarahalli,
	Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

Hi Alex,

On 05/07/2019 14:37, Alex Kiselev wrote:
> пт, 5 июл. 2019 г. в 13:31, Medvedkin, Vladimir <vladimir.medvedkin@intel.com>:
>> Hi Stephen,
>>
>> On 28/06/2019 16:35, Stephen Hemminger wrote:
>>> On Fri, 28 Jun 2019 15:16:30 +0100
>>> "Medvedkin, Vladimir" <vladimir.medvedkin@intel.com> wrote:
>>>
>>>> Hi Honnappa,
>>>>
>>>> On 28/06/2019 14:57, Honnappa Nagarahalli wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> On 28/06/2019 05:34, Stephen Hemminger wrote:
>>>>>>> On Fri, 28 Jun 2019 02:44:54 +0000
>>>>>>> "Ruifeng Wang (Arm Technology China)"<Ruifeng.Wang@arm.com>  wrote:
>>>>>>>
>>>>>>>>>> Tests showed that the function inlining caused performance drop on
>>>>>>>>>> some x86 platforms with the memory ordering patches applied.
>>>>>>>>>> By force no-inline functions, the performance was better than
>>>>>>>>>> before on x86 and no impact to arm64 platforms.
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Medvedkin Vladimir<vladimir.medvedkin@intel.com>
>>>>>>>>>> Signed-off-by: Ruifeng Wang<ruifeng.wang@arm.com>
>>>>>>>>>> Reviewed-by: Gavin Hu<gavin.hu@arm.com>
>>>>>>>>>      {
>>>>>>>>>
>>>>>>>>> Do you actually need to force noinline or is just taking of inline enough?
>>>>>>>>> In general, letting compiler decide is often best practice.
>>>>>>>> The force noinline is an optimization for x86 platforms to keep
>>>>>>>> rte_lpm_add() API performance with memory ordering applied.
>>>>>>> I don't think you answered my question. What does a recent version of
>>>>>>> GCC do if you drop the inline.
>>>>>>>
>>>>>>> Actually all the functions in rte_lpm should drop inline.
>>>>>> I'm agree with Stephen. If it is not a fastpath and size of function is not
>>>>>> minimal it is good to remove inline qualifier for other control plane functions
>>>>>> such as rule_add/delete/find/etc and let the compiler decide to inline it
>>>>>> (unless it affects performance).
>>>>> IMO, the rule needs to be simple. If it is control plane function, we should leave it to the compiler to decide. I do not think we need to worry too much about performance for control plane functions.
>>>> Control plane is not as important as data plane speed but it is still
>>>> important. For lpm we are talking not about initialization, but runtime
>>>> routes add/del related functions. If it is very slow the library will be
>>>> totally unusable because after it receives a route update it will be
>>>> blocked for a long time and route update queue would overflow.
>>> Control plane performance is more impacted by algorithmic choice.
>>> The original LPM had terrible (n^2?) control path. Current code is better.
>>> I had a patch using RB tree, but it was rejected because it used the
>>> /usr/include/bsd/sys/tree.h which added a dependency.
>> You're absolutely right,  control plane performance is mostly depends on
>> algorithm. Current LPM implementation has number of problems there. One
>> problem is rules_tbl[] that is a flat array containing routes for
>> control plane purposes. Replacing it with a rb-tree solves this problem,
>> but there are another problems. For example, when you try to add a route
>> 10.0.0.0/8 while a number of subroutes are exist in the table (for
>> example 10.20.0.0/16), current implementation will load tbl_entry -> do
>> some checks (depth, ext entry) -> conditionally store new entry. Under
>> several circumstances it would take a lot time.  But in fact it needs to
>> unconditionally rewrite only two ranges - from 10.0.0.0 to 10.19.255.255
>> and from 10.21.0.0 to 10.255.255.255. And control plane could help us to
>> get this two ranges. The best struct to do so is lc-tree because it is
>> relatively easy to traverse subtree (described by 10.0.0.0/8) and get
>> subroutes. We are working on a new implementation, hopefully it will be
>> ready soon.
> Have you considered switching to this algorithm?
> http://www.nxlab.fer.hr/dxr/
I considered DXR (and not only, for example poptrie). There are number 
of pros and cons comparing to DIR24-8. In my opinion it would be great 
to provide an option to choose an algo for your routing table.
>
-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20
  2019-07-05 13:45       ` Alex Kiselev
@ 2019-07-05 16:56         ` Medvedkin, Vladimir
  0 siblings, 0 replies; 23+ messages in thread
From: Medvedkin, Vladimir @ 2019-07-05 16:56 UTC (permalink / raw)
  To: Alex Kiselev, Honnappa Nagarahalli
  Cc: Ruifeng Wang (Arm Technology China),
	bruce.richardson, dev, Gavin Hu (Arm Technology China),
	nd

Hi Alex,

On 05/07/2019 14:45, Alex Kiselev wrote:
> Hi,
>
>> <snip>
>>
>>> As a general remark consider writing all of the tbl entries including
>>> tbl8 with atomic_store. Now "lpm->tbl8[j] = new_tbl8_entry;" is looks like
>>>
>>>        1e9:       44 88 9c 47 40 01 00    mov
>>> %r11b,0x2000140(%rdi,%rax,2) <-write first byte
>>>        1f0:       02
>>>        1f1:       48 83 c0 01             add    $0x1,%rax
>>>        1f5:       42 88 8c 47 41 01 00    mov %cl,0x2000141(%rdi,%r8,2) <-write
>>> second byte
>>>        1fc:       02
>>>
>>> This may cause an incorrect nexthop to be returned. If the byte with valid flag
>>> is updated first, the old(and maybe invalid) next hop could be returned.
>> +1
>>
>> It is surprising that the compiler is not generating a single 32b store. As you mentioned 'relaxed' __atomic_store_n should be good.
> Am I right that x86 platform is not affected by the bug since
> store-store could not be reordered on x86?
You right, x86 shouldn't be affected. If I understand asm correctly 
nexthop is written first for both _v20 and _v1604. Memory stores are 
retired in order.

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-07-05 16:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27  9:37 [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Ruifeng Wang
2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 2/3] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
2019-06-27  9:37 ` [dpdk-dev] [PATCH v3 3/3] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
2019-06-28 13:33   ` Medvedkin, Vladimir
2019-06-29 17:35     ` Honnappa Nagarahalli
2019-07-05 13:45       ` Alex Kiselev
2019-07-05 16:56         ` Medvedkin, Vladimir
2019-07-01  7:08     ` Ruifeng Wang (Arm Technology China)
2019-06-27 15:24 ` [dpdk-dev] [PATCH v3 1/3] lib/lpm: not inline unnecessary functions Stephen Hemminger
2019-06-28  2:44   ` Ruifeng Wang (Arm Technology China)
2019-06-28  4:34     ` Stephen Hemminger
2019-06-28  5:48       ` Ruifeng Wang (Arm Technology China)
2019-06-28 13:47       ` Medvedkin, Vladimir
2019-06-28 13:57         ` Honnappa Nagarahalli
2019-06-28 14:16           ` Medvedkin, Vladimir
2019-06-28 15:35             ` Stephen Hemminger
2019-07-01  6:44               ` Ruifeng Wang (Arm Technology China)
2019-07-05 10:40                 ` Medvedkin, Vladimir
2019-07-05 10:58                   ` Ruifeng Wang (Arm Technology China)
2019-07-05 10:31               ` Medvedkin, Vladimir
2019-07-05 13:37                 ` Alex Kiselev
2019-07-05 16:53                   ` Medvedkin, Vladimir
2019-06-28 13:38   ` Medvedkin, Vladimir

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).