[dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
@ 2019-06-05  5:54 Ruifeng Wang
  2019-06-05  5:54 ` [dpdk-dev] [PATCH v1 2/2] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-06-05  5:54 UTC (permalink / raw)
  To: bruce.richardson, vladimir.medvedkin
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

The orderings have impact on LPM performance test.
On Arm A72 platform, delete operation has 2.7% degradation, while
add / lookup has no notable performance change.
On x86 E5 platform, add operation has 4.3% degradation, delete
operation has 2.2% - 10.2% degradation, lookup has no performance
change.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++++
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6b7b28a2e..6ec450a08 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -1063,7 +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	/* Calculate the range and index into Table24. */
 	tbl24_range = depth_to_range(depth);
 	tbl24_index = (ip_masked >> 8);
+	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
 
 	/*
 	 * Firstly check the sub_rule_index. A -1 indicates no replacement rule
@@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1450,7 +1461,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1713,8 +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
 #undef group_idx
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index b886f54b4..6f5704c5c 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
 	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
 	tbl_entry = *ptbl;
 
+	/* Memory ordering is not required in lookup. Because dataflow
+	 * dependency exists, compiler or HW won't be able to re-order
+	 * the operations.
+	 */
 	/* Copy tbl8 entry (only if needed) */
 	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
 			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v1 2/2] lib/lpm: memory orderings to avoid race conditions for v20
  2019-06-05  5:54 [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
@ 2019-06-05  5:54 ` Ruifeng Wang
  2019-06-05 10:50 ` [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Medvedkin, Vladimir
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-06-05  5:54 UTC (permalink / raw)
  To: bruce.richardson, vladimir.medvedkin
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6ec450a08..0addff5d4 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -938,7 +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1320,7 +1323,14 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				struct rte_lpm_tbl_entry_v20 zero_tbl_entry = {
+						.valid = INVALID,
+						.depth = 0,
+						.valid_group = 0,
+					};
+					zero_tbl_entry.next_hop = 0;
+				__atomic_store(&lpm->tbl24[i], &zero_tbl_entry,
+						__ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1365,7 +1375,8 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1647,8 +1658,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1659,8 +1673,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-05  5:54 [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
  2019-06-05  5:54 ` [dpdk-dev] [PATCH v1 2/2] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
@ 2019-06-05 10:50 ` Medvedkin, Vladimir
  2019-06-05 14:12   ` Ruifeng Wang (Arm Technology China)
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
  3 siblings, 1 reply; 24+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-05 10:50 UTC (permalink / raw)
  To: Ruifeng Wang, bruce.richardson; +Cc: dev, honnappa.nagarahalli, gavin.hu, nd

Hi Wang,

On 05/06/2019 06:54, Ruifeng Wang wrote:
> When a tbl8 group is getting attached to a tbl24 entry, lookup
> might fail even though the entry is configured in the table.
>
> For ex: consider a LPM table configured with 10.10.10.1/24.
> When a new entry 10.10.10.32/28 is being added, a new tbl8
> group is allocated and tbl24 entry is changed to point to
> the tbl8 group. If the tbl24 entry is written without the tbl8
> group entries updated, a lookup on 10.10.10.9 will return
> failure.
>
> Correct memory orderings are required to ensure that the
> store to tbl24 does not happen before the stores to tbl8 group
> entries complete.
>
> The orderings have impact on LPM performance test.
> On Arm A72 platform, delete operation has 2.7% degradation, while
> add / lookup has no notable performance change.
> On x86 E5 platform, add operation has 4.3% degradation, delete
> operation has 2.2% - 10.2% degradation, lookup has no performance
> change.

I think it is possible to avoid add/del performance degradation

1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned

2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write

3. Use rte_wmb() after memory write

>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>   lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
>   lib/librte_lpm/rte_lpm.h |  4 ++++
>   2 files changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
> index 6b7b28a2e..6ec450a08 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
>   			/* Setting tbl24 entry in one go to avoid race
>   			 * conditions
>   			 */
> -			lpm->tbl24[i] = new_tbl24_entry;
> +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> +					__ATOMIC_RELEASE);
>   
>   			continue;
>   		}
> @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>   			.depth = 0,
>   		};
>   
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		/* The tbl24 entry must be written only after the
> +		 * tbl8 entries are written.
> +		 */
> +		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>   
>   	} /* If valid entry but not extended calculate the index into Table8. */
>   	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
> @@ -1063,7 +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
>   				.depth = 0,
>   		};
>   
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		/* The tbl24 entry must be written only after the
> +		 * tbl8 entries are written.
> +		 */
> +		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>   
>   	} else { /*
>   		* If it is valid, extended entry calculate the index into tbl8.
> @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   	/* Calculate the range and index into Table24. */
>   	tbl24_range = depth_to_range(depth);
>   	tbl24_index = (ip_masked >> 8);
> +	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
>   
>   	/*
>   	 * Firstly check the sub_rule_index. A -1 indicates no replacement rule
> @@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   
>   			if (lpm->tbl24[i].valid_group == 0 &&
>   					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i].valid = INVALID;
> +				__atomic_store(&lpm->tbl24[i],
> +					&zero_tbl24_entry, __ATOMIC_RELEASE);
>   			} else if (lpm->tbl24[i].valid_group == 1) {
>   				/*
>   				 * If TBL24 entry is extended, then there has
> @@ -1450,7 +1461,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   
>   			if (lpm->tbl24[i].valid_group == 0 &&
>   					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i] = new_tbl24_entry;
> +				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> +						__ATOMIC_RELEASE);
>   			} else  if (lpm->tbl24[i].valid_group == 1) {
>   				/*
>   				 * If TBL24 entry is extended, then there has
> @@ -1713,8 +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, tbl8_group_start);
>   
>   	if (tbl8_recycle_index == -EINVAL) {
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>   		lpm->tbl24[tbl24_index].valid = 0;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>   	} else if (tbl8_recycle_index > -1) {
>   		/* Update tbl24 entry. */
> @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>   			.depth = lpm->tbl8[tbl8_recycle_index].depth,
>   		};
>   
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>   		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>   	}
>   #undef group_idx
> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> index b886f54b4..6f5704c5c 100644
> --- a/lib/librte_lpm/rte_lpm.h
> +++ b/lib/librte_lpm/rte_lpm.h
> @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
>   	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
>   	tbl_entry = *ptbl;
>   
> +	/* Memory ordering is not required in lookup. Because dataflow
> +	 * dependency exists, compiler or HW won't be able to re-order
> +	 * the operations.
> +	 */
>   	/* Copy tbl8 entry (only if needed) */
>   	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
>   			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-05 10:50 ` [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Medvedkin, Vladimir
@ 2019-06-05 14:12   ` Ruifeng Wang (Arm Technology China)
  2019-06-05 19:23     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 24+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-06-05 14:12 UTC (permalink / raw)
  To: Medvedkin, Vladimir, bruce.richardson
  Cc: dev, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, nd

Hi Vladimir,

> -----Original Message-----
> From: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> Sent: Wednesday, June 5, 2019 18:50
> To: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v1 1/2] lib/lpm: memory orderings to avoid race
> conditions for v1604
> 
> Hi Wang,
> 
> On 05/06/2019 06:54, Ruifeng Wang wrote:
> > When a tbl8 group is getting attached to a tbl24 entry, lookup might
> > fail even though the entry is configured in the table.
> >
> > For ex: consider a LPM table configured with 10.10.10.1/24.
> > When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
> > allocated and tbl24 entry is changed to point to the tbl8 group. If
> > the tbl24 entry is written without the tbl8 group entries updated, a
> > lookup on 10.10.10.9 will return failure.
> >
> > Correct memory orderings are required to ensure that the store to
> > tbl24 does not happen before the stores to tbl8 group entries
> > complete.
> >
> > The orderings have impact on LPM performance test.
> > On Arm A72 platform, delete operation has 2.7% degradation, while add
> > / lookup has no notable performance change.
> > On x86 E5 platform, add operation has 4.3% degradation, delete
> > operation has 2.2% - 10.2% degradation, lookup has no performance
> > change.
> 
> I think it is possible to avoid add/del performance degradation
> 
> 1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned
> 
> 2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write
> 
> 3. Use rte_wmb() after memory write
> 

Thanks for your suggestions.
Point 1 & 2 make sense.

For point 3, are you suggesting using rte_wmb() instead of __atomic_store()? 
rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store() with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise?

> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >   lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
> >   lib/librte_lpm/rte_lpm.h |  4 ++++
> >   2 files changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> > 6b7b28a2e..6ec450a08 100644
> > --- a/lib/librte_lpm/rte_lpm.c
> > +++ b/lib/librte_lpm/rte_lpm.c
> > @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
> >   			/* Setting tbl24 entry in one go to avoid race
> >   			 * conditions
> >   			 */
> > -			lpm->tbl24[i] = new_tbl24_entry;
> > +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> > +					__ATOMIC_RELEASE);
> >
> >   			continue;
> >   		}
> > @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked, uint8_t depth,
> >   			.depth = 0,
> >   		};
> >
> > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		/* The tbl24 entry must be written only after the
> > +		 * tbl8 entries are written.
> > +		 */
> > +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> > +				__ATOMIC_RELEASE);
> >
> >   	} /* If valid entry but not extended calculate the index into Table8. */
> >   	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7
> > +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked, uint8_t depth,
> >   				.depth = 0,
> >   		};
> >
> > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		/* The tbl24 entry must be written only after the
> > +		 * tbl8 entries are written.
> > +		 */
> > +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> > +				__ATOMIC_RELEASE);
> >
> >   	} else { /*
> >   		* If it is valid, extended entry calculate the index into tbl8.
> > @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
> >   	/* Calculate the range and index into Table24. */
> >   	tbl24_range = depth_to_range(depth);
> >   	tbl24_index = (ip_masked >> 8);
> > +	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
> >
> >   	/*
> >   	 * Firstly check the sub_rule_index. A -1 indicates no replacement
> > rule @@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm
> > *lpm, uint32_t ip_masked,
> >
> >   			if (lpm->tbl24[i].valid_group == 0 &&
> >   					lpm->tbl24[i].depth <= depth) {
> > -				lpm->tbl24[i].valid = INVALID;
> > +				__atomic_store(&lpm->tbl24[i],
> > +					&zero_tbl24_entry,
> __ATOMIC_RELEASE);
> >   			} else if (lpm->tbl24[i].valid_group == 1) {
> >   				/*
> >   				 * If TBL24 entry is extended, then there has
> @@ -1450,7 +1461,8
> > @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
> >
> >   			if (lpm->tbl24[i].valid_group == 0 &&
> >   					lpm->tbl24[i].depth <= depth) {
> > -				lpm->tbl24[i] = new_tbl24_entry;
> > +				__atomic_store(&lpm->tbl24[i],
> &new_tbl24_entry,
> > +						__ATOMIC_RELEASE);
> >   			} else  if (lpm->tbl24[i].valid_group == 1) {
> >   				/*
> >   				 * If TBL24 entry is extended, then there has
> @@ -1713,8
> > +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked,
> >   	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8,
> > tbl8_group_start);
> >
> >   	if (tbl8_recycle_index == -EINVAL) {
> > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > +		 * Prevent the free of the tbl8 group from hoisting.
> > +		 */
> >   		lpm->tbl24[tbl24_index].valid = 0;
> > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> >   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
> >   	} else if (tbl8_recycle_index > -1) {
> >   		/* Update tbl24 entry. */
> > @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
> >   			.depth = lpm->tbl8[tbl8_recycle_index].depth,
> >   		};
> >
> > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > +		 * Prevent the free of the tbl8 group from hoisting.
> > +		 */
> >   		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> >   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
> >   	}
> >   #undef group_idx
> > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h index
> > b886f54b4..6f5704c5c 100644
> > --- a/lib/librte_lpm/rte_lpm.h
> > +++ b/lib/librte_lpm/rte_lpm.h
> > @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip,
> uint32_t *next_hop)
> >   	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
> >   	tbl_entry = *ptbl;
> >
> > +	/* Memory ordering is not required in lookup. Because dataflow
> > +	 * dependency exists, compiler or HW won't be able to re-order
> > +	 * the operations.
> > +	 */
> >   	/* Copy tbl8 entry (only if needed) */
> >   	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
> >   			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
> 
> --
> Regards,
> Vladimir

Regards,
/Ruifeng

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-05 14:12   ` Ruifeng Wang (Arm Technology China)
@ 2019-06-05 19:23     ` Honnappa Nagarahalli
  2019-06-10 15:22       ` Medvedkin, Vladimir
  0 siblings, 1 reply; 24+ messages in thread
From: Honnappa Nagarahalli @ 2019-06-05 19:23 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China),
	Medvedkin,  Vladimir, bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China), nd, nd

> >
> > Hi Wang,
> >
> > On 05/06/2019 06:54, Ruifeng Wang wrote:
> > > When a tbl8 group is getting attached to a tbl24 entry, lookup might
> > > fail even though the entry is configured in the table.
> > >
> > > For ex: consider a LPM table configured with 10.10.10.1/24.
> > > When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
> > > allocated and tbl24 entry is changed to point to the tbl8 group. If
> > > the tbl24 entry is written without the tbl8 group entries updated, a
> > > lookup on 10.10.10.9 will return failure.
> > >
> > > Correct memory orderings are required to ensure that the store to
> > > tbl24 does not happen before the stores to tbl8 group entries
> > > complete.
> > >
> > > The orderings have impact on LPM performance test.
> > > On Arm A72 platform, delete operation has 2.7% degradation, while
> > > add / lookup has no notable performance change.
> > > On x86 E5 platform, add operation has 4.3% degradation, delete
> > > operation has 2.2% - 10.2% degradation, lookup has no performance
> > > change.
> >
> > I think it is possible to avoid add/del performance degradation
My understanding was that the degradation on x86, is happening because of the additional compiler barriers this patch introduces. For Arm platform the degradation is caused by the store-release memory barriers.

> >
> > 1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned
The ' rte_lpm_tbl_entry' is already 32b, shouldn't it be aligned on 4-byte boundary already?

> >
> > 2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write
> >
> > 3. Use rte_wmb() after memory write
(It would be good to point the locations in the patch). I assume you are referring to __atomic_store(__ATOMIC_RELEASE). I am wondering if rte_wmb() is required? My understanding is that x86 would require just a compiler barrier. So, should it be rte_smp_wmb()? __atomic_store(__ATOMIC_RELEASE) just adds a compiler barrier for x86.

> >
> 
> Thanks for your suggestions.
> Point 1 & 2 make sense.
> 
> For point 3, are you suggesting using rte_wmb() instead of __atomic_store()?
> rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store()
> with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise?
IMO, code becomes difficult to manage.

> 
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >   lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
> > >   lib/librte_lpm/rte_lpm.h |  4 ++++
> > >   2 files changed, 29 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
> > > index
> > > 6b7b28a2e..6ec450a08 100644
> > > --- a/lib/librte_lpm/rte_lpm.c
> > > +++ b/lib/librte_lpm/rte_lpm.c
> > > @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm,
> > uint32_t ip, uint8_t depth,
> > >   			/* Setting tbl24 entry in one go to avoid race
> > >   			 * conditions
> > >   			 */
> > > -			lpm->tbl24[i] = new_tbl24_entry;
> > > +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> > > +					__ATOMIC_RELEASE);
> > >
> > >   			continue;
> > >   		}
> > > @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm,
> > uint32_t ip_masked, uint8_t depth,
> > >   			.depth = 0,
> > >   		};
> > >
> > > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > > +		/* The tbl24 entry must be written only after the
> > > +		 * tbl8 entries are written.
> > > +		 */
> > > +		__atomic_store(&lpm->tbl24[tbl24_index],
> > &new_tbl24_entry,
> > > +				__ATOMIC_RELEASE);
> > >
> > >   	} /* If valid entry but not extended calculate the index into Table8. */
> > >   	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7
> > > +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> > ip_masked, uint8_t depth,
> > >   				.depth = 0,
> > >   		};
> > >
> > > -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > > +		/* The tbl24 entry must be written only after the
> > > +		 * tbl8 entries are written.
> > > +		 */
> > > +		__atomic_store(&lpm->tbl24[tbl24_index],
> > &new_tbl24_entry,
> > > +				__ATOMIC_RELEASE);
> > >
> > >   	} else { /*
> > >   		* If it is valid, extended entry calculate the index into tbl8.
> > > @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
> > uint32_t ip_masked,
> > >   	/* Calculate the range and index into Table24. */
> > >   	tbl24_range = depth_to_range(depth);
> > >   	tbl24_index = (ip_masked >> 8);
> > > +	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
> > >
> > >   	/*
> > >   	 * Firstly check the sub_rule_index. A -1 indicates no
> > > replacement rule @@ -1405,7 +1415,8 @@
> > > delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
> > >
> > >   			if (lpm->tbl24[i].valid_group == 0 &&
> > >   					lpm->tbl24[i].depth <= depth) {
> > > -				lpm->tbl24[i].valid = INVALID;
> > > +				__atomic_store(&lpm->tbl24[i],
> > > +					&zero_tbl24_entry,
> > __ATOMIC_RELEASE);
> > >   			} else if (lpm->tbl24[i].valid_group == 1) {
> > >   				/*
> > >   				 * If TBL24 entry is extended, then there has
> > @@ -1450,7 +1461,8
> > > @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
> > >
> > >   			if (lpm->tbl24[i].valid_group == 0 &&
> > >   					lpm->tbl24[i].depth <= depth) {
> > > -				lpm->tbl24[i] = new_tbl24_entry;
> > > +				__atomic_store(&lpm->tbl24[i],
> > &new_tbl24_entry,
> > > +						__ATOMIC_RELEASE);
> > >   			} else  if (lpm->tbl24[i].valid_group == 1) {
> > >   				/*
> > >   				 * If TBL24 entry is extended, then there has
> > @@ -1713,8
> > > +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> > ip_masked,
> > >   	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8,
> > > tbl8_group_start);
> > >
> > >   	if (tbl8_recycle_index == -EINVAL) {
> > > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > > +		 * Prevent the free of the tbl8 group from hoisting.
> > > +		 */
> > >   		lpm->tbl24[tbl24_index].valid = 0;
> > > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> > >   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
> > >   	} else if (tbl8_recycle_index > -1) {
> > >   		/* Update tbl24 entry. */
> > > @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm,
> > uint32_t ip_masked,
> > >   			.depth = lpm->tbl8[tbl8_recycle_index].depth,
> > >   		};
> > >
> > > -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> > > +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> > > +		 * Prevent the free of the tbl8 group from hoisting.
> > > +		 */
> > >   		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> > > +		__atomic_thread_fence(__ATOMIC_RELEASE);
> > >   		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
> > >   	}
> > >   #undef group_idx
> > > diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> > > index b886f54b4..6f5704c5c 100644
> > > --- a/lib/librte_lpm/rte_lpm.h
> > > +++ b/lib/librte_lpm/rte_lpm.h
> > > @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t
> > > ip,
> > uint32_t *next_hop)
> > >   	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
> > >   	tbl_entry = *ptbl;
> > >
> > > +	/* Memory ordering is not required in lookup. Because dataflow
> > > +	 * dependency exists, compiler or HW won't be able to re-order
> > > +	 * the operations.
> > > +	 */
> > >   	/* Copy tbl8 entry (only if needed) */
> > >   	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
> > >   			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
> >
> > --
> > Regards,
> > Vladimir
> 
> Regards,
> /Ruifeng

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-05 19:23     ` Honnappa Nagarahalli
@ 2019-06-10 15:22       ` Medvedkin, Vladimir
  2019-06-17 15:27         ` Ruifeng Wang (Arm Technology China)
  0 siblings, 1 reply; 24+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-10 15:22 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China), nd

Hi Honnappa, Wang,

On 05/06/2019 20:23, Honnappa Nagarahalli wrote:
>>> Hi Wang,
>>>
>>> On 05/06/2019 06:54, Ruifeng Wang wrote:
>>>> When a tbl8 group is getting attached to a tbl24 entry, lookup might
>>>> fail even though the entry is configured in the table.
>>>>
>>>> For ex: consider a LPM table configured with 10.10.10.1/24.
>>>> When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
>>>> allocated and tbl24 entry is changed to point to the tbl8 group. If
>>>> the tbl24 entry is written without the tbl8 group entries updated, a
>>>> lookup on 10.10.10.9 will return failure.
>>>>
>>>> Correct memory orderings are required to ensure that the store to
>>>> tbl24 does not happen before the stores to tbl8 group entries
>>>> complete.
>>>>
>>>> The orderings have impact on LPM performance test.
>>>> On Arm A72 platform, delete operation has 2.7% degradation, while
>>>> add / lookup has no notable performance change.
>>>> On x86 E5 platform, add operation has 4.3% degradation, delete
>>>> operation has 2.2% - 10.2% degradation, lookup has no performance
>>>> change.
>>> I think it is possible to avoid add/del performance degradation
> My understanding was that the degradation on x86, is happening because of the additional compiler barriers this patch introduces. For Arm platform the degradation is caused by the store-release memory barriers.
Just made some tests on skylake and sandy bridge. On the Skylake there 
is no performance degradation after applying this patchset. On the 
Sandybridge there is performance drop for rte_lpm_add() (from 460k 
cycles to 530k cycles in lpm_performance unit test). This is caused by 1 
chunk of this patchset  (add_depth_small_v1604() ). And it looks like 
after uninlining of this function performance get back to original 460k 
cycles it was before patch.
>
>>> 1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned
> The ' rte_lpm_tbl_entry' is already 32b, shouldn't it be aligned on 4-byte boundary already?
>
>>> 2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write
>>>
>>> 3. Use rte_wmb() after memory write
> (It would be good to point the locations in the patch). I assume you are referring to __atomic_store(__ATOMIC_RELEASE). I am wondering if rte_wmb() is required? My understanding is that x86 would require just a compiler barrier. So, should it be rte_smp_wmb()? __atomic_store(__ATOMIC_RELEASE) just adds a compiler barrier for x86.
You right, it needs just a compiller barrier for x86 and a memory 
barrier instruction (dmb ?) for arm, so rte_smp_wmb() looks appropriate 
here as well as __atomic_store(__ATOMIC_RELEASE).
>
>> Thanks for your suggestions.
>> Point 1 & 2 make sense.
>>
>> For point 3, are you suggesting using rte_wmb() instead of __atomic_store()?
>> rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store()
>> with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise?
> IMO, code becomes difficult to manage.
>
>>>> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>>>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>>> ---
>>>>    lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
>>>>    lib/librte_lpm/rte_lpm.h |  4 ++++
>>>>    2 files changed, 29 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
>>>> index
>>>> 6b7b28a2e..6ec450a08 100644
>>>> --- a/lib/librte_lpm/rte_lpm.c
>>>> +++ b/lib/librte_lpm/rte_lpm.c
>>>> @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm,
>>> uint32_t ip, uint8_t depth,
>>>>    			/* Setting tbl24 entry in one go to avoid race
>>>>    			 * conditions
>>>>    			 */
>>>> -			lpm->tbl24[i] = new_tbl24_entry;
>>>> +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
>>>> +					__ATOMIC_RELEASE);

I don't see reordering issue here in this patch chunk. However direct 
assignment was translated to 2 MOV ops

mov    (%rdi,%rcx,4),%edx  <-- get lpm->tbl24[i]

and    $0xff000000,%edx    <-- clean .next_hop

or     %r9d,%edx        <-- save new next_hop

mov    %edx,(%rdi,%rcx,4)  <-- save an entry with new next_hop but old 
depth and valid bitfields

mov    %r11b,0x3(%rdi,%rcx,4)  <-- save new depth and valid bitfields

so agree with __atomic_store() here.

>>>>
>>>>    			continue;
>>>>    		}
>>>> @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm,
>>> uint32_t ip_masked, uint8_t depth,
>>>>    			.depth = 0,
>>>>    		};
>>>>
>>>> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
>>>> +		/* The tbl24 entry must be written only after the
>>>> +		 * tbl8 entries are written.
>>>> +		 */
>>>> +		__atomic_store(&lpm->tbl24[tbl24_index],
>>> &new_tbl24_entry,
>>>> +				__ATOMIC_RELEASE);
>>>>
>>>>    	} /* If valid entry but not extended calculate the index into Table8. */
>>>>    	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7
>>>> +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t
>>> ip_masked, uint8_t depth,
>>>>    				.depth = 0,
>>>>    		};
>>>>
>>>> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
>>>> +		/* The tbl24 entry must be written only after the
>>>> +		 * tbl8 entries are written.
>>>> +		 */
>>>> +		__atomic_store(&lpm->tbl24[tbl24_index],
>>> &new_tbl24_entry,
>>>> +				__ATOMIC_RELEASE);
>>>>
>>>>    	} else { /*
>>>>    		* If it is valid, extended entry calculate the index into tbl8.
>>>> @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
>>> uint32_t ip_masked,
>>>>    	/* Calculate the range and index into Table24. */
>>>>    	tbl24_range = depth_to_range(depth);
>>>>    	tbl24_index = (ip_masked >> 8);
>>>> +	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
>>>>
>>>>    	/*
>>>>    	 * Firstly check the sub_rule_index. A -1 indicates no
>>>> replacement rule @@ -1405,7 +1415,8 @@
>>>> delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>>>>
>>>>    			if (lpm->tbl24[i].valid_group == 0 &&
>>>>    					lpm->tbl24[i].depth <= depth) {
>>>> -				lpm->tbl24[i].valid = INVALID;
>>>> +				__atomic_store(&lpm->tbl24[i],
>>>> +					&zero_tbl24_entry,
>>> __ATOMIC_RELEASE);
>>>>    			} else if (lpm->tbl24[i].valid_group == 1) {
>>>>    				/*
>>>>    				 * If TBL24 entry is extended, then there has
>>> @@ -1450,7 +1461,8
>>>> @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>>>>
>>>>    			if (lpm->tbl24[i].valid_group == 0 &&
>>>>    					lpm->tbl24[i].depth <= depth) {
>>>> -				lpm->tbl24[i] = new_tbl24_entry;
>>>> +				__atomic_store(&lpm->tbl24[i],
>>> &new_tbl24_entry,
>>>> +						__ATOMIC_RELEASE);
>>>>    			} else  if (lpm->tbl24[i].valid_group == 1) {
>>>>    				/*
>>>>    				 * If TBL24 entry is extended, then there has
>>> @@ -1713,8
>>>> +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t
>>> ip_masked,
>>>>    	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8,
>>>> tbl8_group_start);
>>>>
>>>>    	if (tbl8_recycle_index == -EINVAL) {
>>>> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
>>>> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
>>>> +		 * Prevent the free of the tbl8 group from hoisting.
>>>> +		 */
>>>>    		lpm->tbl24[tbl24_index].valid = 0;
>>>> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>>>>    		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>>>>    	} else if (tbl8_recycle_index > -1) {
>>>>    		/* Update tbl24 entry. */
>>>> @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm,
>>> uint32_t ip_masked,
>>>>    			.depth = lpm->tbl8[tbl8_recycle_index].depth,
>>>>    		};
>>>>
>>>> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
>>>> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
>>>> +		 * Prevent the free of the tbl8 group from hoisting.
>>>> +		 */
>>>>    		lpm->tbl24[tbl24_index] = new_tbl24_entry;
>>>> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>>>>    		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>>>>    	}
>>>>    #undef group_idx
>>>> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
>>>> index b886f54b4..6f5704c5c 100644
>>>> --- a/lib/librte_lpm/rte_lpm.h
>>>> +++ b/lib/librte_lpm/rte_lpm.h
>>>> @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t
>>>> ip,
>>> uint32_t *next_hop)
>>>>    	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
>>>>    	tbl_entry = *ptbl;
>>>>
>>>> +	/* Memory ordering is not required in lookup. Because dataflow
>>>> +	 * dependency exists, compiler or HW won't be able to re-order
>>>> +	 * the operations.
>>>> +	 */
>>>>    	/* Copy tbl8 entry (only if needed) */
>>>>    	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
>>>>    			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
>>> --
>>> Regards,
>>> Vladimir
>> Regards,
>> /Ruifeng

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-10 15:22       ` Medvedkin, Vladimir
@ 2019-06-17 15:27         ` Ruifeng Wang (Arm Technology China)
  2019-06-17 15:33           ` Medvedkin, Vladimir
  0 siblings, 1 reply; 24+ messages in thread
From: Ruifeng Wang (Arm Technology China) @ 2019-06-17 15:27 UTC (permalink / raw)
  To: Medvedkin, Vladimir, Honnappa Nagarahalli, bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China), nd, nd

Hi Vladimir,

From: Medvedkin, Vladimir <vladimir.medvedkin@intel.com> 
Sent: Monday, June 10, 2019 23:23
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; bruce.richardson@intel.com
Cc: dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
Subject: Re: [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604

Hi Honnappa, Wang, 

On 05/06/2019 20:23, Honnappa Nagarahalli wrote:

Hi Wang,

On 05/06/2019 06:54, Ruifeng Wang wrote:
When a tbl8 group is getting attached to a tbl24 entry, lookup might
fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
allocated and tbl24 entry is changed to point to the tbl8 group. If
the tbl24 entry is written without the tbl8 group entries updated, a
lookup on 10.10.10.9 will return failure.

Correct memory orderings are required to ensure that the store to
tbl24 does not happen before the stores to tbl8 group entries
complete.

The orderings have impact on LPM performance test.
On Arm A72 platform, delete operation has 2.7% degradation, while
add / lookup has no notable performance change.
On x86 E5 platform, add operation has 4.3% degradation, delete
operation has 2.2% - 10.2% degradation, lookup has no performance
change.

I think it is possible to avoid add/del performance degradation
My understanding was that the degradation on x86, is happening because of the additional compiler barriers this patch introduces. For Arm platform the degradation is caused by the store-release memory barriers.
Just made some tests on skylake and sandy bridge. On the Skylake there is no performance degradation after applying this patchset. On the Sandybridge there is performance drop for rte_lpm_add() (from 460k cycles to 530k cycles in lpm_performance unit test). This is caused by 1 chunk of this patchset  (add_depth_small_v1604() ). And it looks like after uninlining of this function performance get back to original 460k cycles it was before patch. 

[Ruifeng] Are you suggesting to un-inline add_depth_small_v1604()? I'm OK with such change since the function is too big and is not necessary to be inlined. 

1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned
The ' rte_lpm_tbl_entry' is already 32b, shouldn't it be aligned on 4-byte boundary already?

2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write

3. Use rte_wmb() after memory write
(It would be good to point the locations in the patch). I assume you are referring to __atomic_store(__ATOMIC_RELEASE). I am wondering if rte_wmb() is required? My understanding is that x86 would require just a compiler barrier. So, should it be rte_smp_wmb()? __atomic_store(__ATOMIC_RELEASE) just adds a compiler barrier for x86.
You right, it needs just a compiller barrier for x86 and a memory barrier instruction (dmb ?) for arm, so rte_smp_wmb() looks appropriate here as well as __atomic_store(__ATOMIC_RELEASE). 

Thanks for your suggestions.
Point 1 & 2 make sense.

For point 3, are you suggesting using rte_wmb() instead of __atomic_store()?
rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store()
with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise?
IMO, code becomes difficult to manage.

Signed-off-by: Honnappa Nagarahalli mailto:honnappa.nagarahalli@arm.com
Signed-off-by: Ruifeng Wang mailto:ruifeng.wang@arm.com
---
  lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
  lib/librte_lpm/rte_lpm.h |  4 ++++
  2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index
6b7b28a2e..6ec450a08 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm,
uint32_t ip, uint8_t depth,
  			/* Setting tbl24 entry in one go to avoid race
  			 * conditions
  			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
I don't see reordering issue here in this patch chunk. However direct assignment was translated to 2 MOV ops
mov    (%rdi,%rcx,4),%edx  <-- get lpm->tbl24[i]
and    $0xff000000,%edx    <-- clean .next_hop
or     %r9d,%edx        <-- save new next_hop
mov    %edx,(%rdi,%rcx,4)  <-- save an entry with new next_hop but old depth and valid bitfields
mov    %r11b,0x3(%rdi,%rcx,4)  <-- save new depth and valid bitfields
so agree with __atomic_store() here.

  			continue;
  		}
@@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm,
uint32_t ip_masked, uint8_t depth,
  			.depth = 0,
  		};

-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index],
&new_tbl24_entry,
+				__ATOMIC_RELEASE);

  	} /* If valid entry but not extended calculate the index into Table8. */
  	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7
+1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t
ip_masked, uint8_t depth,
  				.depth = 0,
  		};

-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index],
&new_tbl24_entry,
+				__ATOMIC_RELEASE);

  	} else { /*
  		* If it is valid, extended entry calculate the index into tbl8.
@@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
uint32_t ip_masked,
  	/* Calculate the range and index into Table24. */
  	tbl24_range = depth_to_range(depth);
  	tbl24_index = (ip_masked >> 8);
+	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};

  	/*
  	 * Firstly check the sub_rule_index. A -1 indicates no
replacement rule @@ -1405,7 +1415,8 @@
delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,

  			if (lpm->tbl24[i].valid_group == 0 &&
  					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry,
__ATOMIC_RELEASE);
  			} else if (lpm->tbl24[i].valid_group == 1) {
  				/*
  				 * If TBL24 entry is extended, then there has
@@ -1450,7 +1461,8
@@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,

  			if (lpm->tbl24[i].valid_group == 0 &&
  					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i],
&new_tbl24_entry,
+						__ATOMIC_RELEASE);
  			} else  if (lpm->tbl24[i].valid_group == 1) {
  				/*
  				 * If TBL24 entry is extended, then there has
@@ -1713,8
+1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t
ip_masked,
  	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8,
tbl8_group_start);

  	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
  		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
  		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
  	} else if (tbl8_recycle_index > -1) {
  		/* Update tbl24 entry. */
@@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm,
uint32_t ip_masked,
  			.depth = lpm->tbl8[tbl8_recycle_index].depth,
  		};

-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
  		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
  		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
  	}
  #undef group_idx
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index b886f54b4..6f5704c5c 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t
ip,
uint32_t *next_hop)
  	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
  	tbl_entry = *ptbl;

+	/* Memory ordering is not required in lookup. Because dataflow
+	 * dependency exists, compiler or HW won't be able to re-order
+	 * the operations.
+	 */
  	/* Copy tbl8 entry (only if needed) */
  	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
  			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {

--
Regards,
Vladimir

Regards,
/Ruifeng
-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-06-17 15:27         ` Ruifeng Wang (Arm Technology China)
@ 2019-06-17 15:33           ` Medvedkin, Vladimir
  0 siblings, 0 replies; 24+ messages in thread
From: Medvedkin, Vladimir @ 2019-06-17 15:33 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China),
	Honnappa Nagarahalli, bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China), nd

Hi Wang,

On 17/06/2019 16:27, Ruifeng Wang (Arm Technology China) wrote:
> Hi Vladimir,
>
>
> From: Medvedkin, Vladimir <vladimir.medvedkin@intel.com>
> Sent: Monday, June 10, 2019 23:23
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; bruce.richardson@intel.com
> Cc: dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604
>
> Hi Honnappa, Wang,
>
> On 05/06/2019 20:23, Honnappa Nagarahalli wrote:
>
> Hi Wang,
>
> On 05/06/2019 06:54, Ruifeng Wang wrote:
> When a tbl8 group is getting attached to a tbl24 entry, lookup might
> fail even though the entry is configured in the table.
>
> For ex: consider a LPM table configured with 10.10.10.1/24.
> When a new entry 10.10.10.32/28 is being added, a new tbl8 group is
> allocated and tbl24 entry is changed to point to the tbl8 group. If
> the tbl24 entry is written without the tbl8 group entries updated, a
> lookup on 10.10.10.9 will return failure.
>
> Correct memory orderings are required to ensure that the store to
> tbl24 does not happen before the stores to tbl8 group entries
> complete.
>
> The orderings have impact on LPM performance test.
> On Arm A72 platform, delete operation has 2.7% degradation, while
> add / lookup has no notable performance change.
> On x86 E5 platform, add operation has 4.3% degradation, delete
> operation has 2.2% - 10.2% degradation, lookup has no performance
> change.
>
> I think it is possible to avoid add/del performance degradation
> My understanding was that the degradation on x86, is happening because of the additional compiler barriers this patch introduces. For Arm platform the degradation is caused by the store-release memory barriers.
> Just made some tests on skylake and sandy bridge. On the Skylake there is no performance degradation after applying this patchset. On the Sandybridge there is performance drop for rte_lpm_add() (from 460k cycles to 530k cycles in lpm_performance unit test). This is caused by 1 chunk of this patchset  (add_depth_small_v1604() ). And it looks like after uninlining of this function performance get back to original 460k cycles it was before patch.
>
> [Ruifeng] Are you suggesting to un-inline add_depth_small_v1604()? I'm OK with such change since the function is too big and is not necessary to be inlined.
That's right. Try to uninline it (and maybe add_depth_big(), it depends 
on your set of prefixes) and  run your performance tests.
>
>
> 1. Explicitly mark struct rte_lpm_tbl_entry 4-byte aligned
> The ' rte_lpm_tbl_entry' is already 32b, shouldn't it be aligned on 4-byte boundary already?
>
>
> 2. Cast value to uint32_t (uint16_t for 2.0 version) on memory write
>
> 3. Use rte_wmb() after memory write
> (It would be good to point the locations in the patch). I assume you are referring to __atomic_store(__ATOMIC_RELEASE). I am wondering if rte_wmb() is required? My understanding is that x86 would require just a compiler barrier. So, should it be rte_smp_wmb()? __atomic_store(__ATOMIC_RELEASE) just adds a compiler barrier for x86.
> You right, it needs just a compiller barrier for x86 and a memory barrier instruction (dmb ?) for arm, so rte_smp_wmb() looks appropriate here as well as __atomic_store(__ATOMIC_RELEASE).
>
>
>
> Thanks for your suggestions.
> Point 1 & 2 make sense.
>
> For point 3, are you suggesting using rte_wmb() instead of __atomic_store()?
> rte_wmb() is DPDK made memory model. Maybe we can use __atomic_store()
> with 'RTE_USE_C11_MEM_MODEL=y', and use rte_wmb() otherwise?
> IMO, code becomes difficult to manage.
>
>
>
> Signed-off-by: Honnappa Nagarahalli mailto:honnappa.nagarahalli@arm.com
> Signed-off-by: Ruifeng Wang mailto:ruifeng.wang@arm.com
> ---
>    lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
>    lib/librte_lpm/rte_lpm.h |  4 ++++
>    2 files changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
> index
> 6b7b28a2e..6ec450a08 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip, uint8_t depth,
>    			/* Setting tbl24 entry in one go to avoid race
>    			 * conditions
>    			 */
> -			lpm->tbl24[i] = new_tbl24_entry;
> +			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
> +					__ATOMIC_RELEASE);
> I don't see reordering issue here in this patch chunk. However direct assignment was translated to 2 MOV ops
> mov    (%rdi,%rcx,4),%edx  <-- get lpm->tbl24[i]
> and    $0xff000000,%edx    <-- clean .next_hop
> or     %r9d,%edx        <-- save new next_hop
> mov    %edx,(%rdi,%rcx,4)  <-- save an entry with new next_hop but old depth and valid bitfields
> mov    %r11b,0x3(%rdi,%rcx,4)  <-- save new depth and valid bitfields
> so agree with __atomic_store() here.
>
>
>    			continue;
>    		}
> @@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked, uint8_t depth,
>    			.depth = 0,
>    		};
>
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		/* The tbl24 entry must be written only after the
> +		 * tbl8 entries are written.
> +		 */
> +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>
>    	} /* If valid entry but not extended calculate the index into Table8. */
>    	else if (lpm->tbl24[tbl24_index].valid_group == 0) { @@ -1063,7
> +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked, uint8_t depth,
>    				.depth = 0,
>    		};
>
> -		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		/* The tbl24 entry must be written only after the
> +		 * tbl8 entries are written.
> +		 */
> +		__atomic_store(&lpm->tbl24[tbl24_index],
> &new_tbl24_entry,
> +				__ATOMIC_RELEASE);
>
>    	} else { /*
>    		* If it is valid, extended entry calculate the index into tbl8.
> @@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
>    	/* Calculate the range and index into Table24. */
>    	tbl24_range = depth_to_range(depth);
>    	tbl24_index = (ip_masked >> 8);
> +	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
>
>    	/*
>    	 * Firstly check the sub_rule_index. A -1 indicates no
> replacement rule @@ -1405,7 +1415,8 @@
> delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>
>    			if (lpm->tbl24[i].valid_group == 0 &&
>    					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i].valid = INVALID;
> +				__atomic_store(&lpm->tbl24[i],
> +					&zero_tbl24_entry,
> __ATOMIC_RELEASE);
>    			} else if (lpm->tbl24[i].valid_group == 1) {
>    				/*
>    				 * If TBL24 entry is extended, then there has
> @@ -1450,7 +1461,8
> @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
>
>    			if (lpm->tbl24[i].valid_group == 0 &&
>    					lpm->tbl24[i].depth <= depth) {
> -				lpm->tbl24[i] = new_tbl24_entry;
> +				__atomic_store(&lpm->tbl24[i],
> &new_tbl24_entry,
> +						__ATOMIC_RELEASE);
>    			} else  if (lpm->tbl24[i].valid_group == 1) {
>    				/*
>    				 * If TBL24 entry is extended, then there has
> @@ -1713,8
> +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t
> ip_masked,
>    	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8,
> tbl8_group_start);
>
>    	if (tbl8_recycle_index == -EINVAL) {
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>    		lpm->tbl24[tbl24_index].valid = 0;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>    		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>    	} else if (tbl8_recycle_index > -1) {
>    		/* Update tbl24 entry. */
> @@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked,
>    			.depth = lpm->tbl8[tbl8_recycle_index].depth,
>    		};
>
> -		/* Set tbl24 before freeing tbl8 to avoid race condition. */
> +		/* Set tbl24 before freeing tbl8 to avoid race condition.
> +		 * Prevent the free of the tbl8 group from hoisting.
> +		 */
>    		lpm->tbl24[tbl24_index] = new_tbl24_entry;
> +		__atomic_thread_fence(__ATOMIC_RELEASE);
>    		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
>    	}
>    #undef group_idx
> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
> index b886f54b4..6f5704c5c 100644
> --- a/lib/librte_lpm/rte_lpm.h
> +++ b/lib/librte_lpm/rte_lpm.h
> @@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t
> ip,
> uint32_t *next_hop)
>    	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
>    	tbl_entry = *ptbl;
>
> +	/* Memory ordering is not required in lookup. Because dataflow
> +	 * dependency exists, compiler or HW won't be able to re-order
> +	 * the operations.
> +	 */
>    	/* Copy tbl8 entry (only if needed) */
>    	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
>    			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
>
> --
> Regards,
> Vladimir
>
> Regards,
> /Ruifeng

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes
  2019-06-05  5:54 [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
  2019-06-05  5:54 ` [dpdk-dev] [PATCH v1 2/2] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
  2019-06-05 10:50 ` [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Medvedkin, Vladimir
@ 2019-07-12  3:09 ` Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 1/6] lib/lpm: not inline unnecessary functions Ruifeng Wang
                     ` (5 more replies)
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
  3 siblings, 6 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

LPM4 uses DIR24-8 method of routing info data organization.
Routing rule with prefix longer than 24 bits will be stored
in a tbl24 entry together with an associated tbl8 group.

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

When memory orderings are imposed, API performance could drop.
However, a patch to de-inline unnecessary functions is added,
and this helps to keep API performance.

On Arm A72 platform, the whole patch series in general have no
notable performance impact.

On x86 E5-2640 platform, the whole patch series showed 2.6%
performance improvement on add, and no impact on lookup or
delete.


v5:
Fixed typo in commit message.
Remove all other inlines.
Change to use atomic_store in tbl8_alloc / tbl8_free.
Merge multiple sigle field updates into one atomic store.
v4:
Changed alignment attribute parameter.
Use atomic store to avoid partial update.
v3:
Use __rte_noinline to force no inline.
v2:
Fixed clang building issue by supplying alignment attribute.

Ruifeng Wang (6):
  lib/lpm: not inline unnecessary functions
  lib/lpm: memory orderings to avoid race conditions for v1604
  lib/lpm: memory orderings to avoid race conditions for v20
  lib/lpm: use atomic store to avoid partial update
  lib/lpm: data update optimization for v1604
  lib/lpm: data update optimization for v20

 lib/librte_lpm/rte_lpm.c | 248 ++++++++++++++++++++++++++++-----------
 lib/librte_lpm/rte_lpm.h |   8 +-
 2 files changed, 183 insertions(+), 73 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 1/6] lib/lpm: not inline unnecessary functions
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 2/6] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

Tests showed that the function inlining caused performance drop
on some x86 platforms with the memory ordering patches applied.
By force no-inline functions, the performance was better than
before on x86 and no impact to arm64 platforms.

Besides inlines of other functions are removed to let compiler
to decide whether to inline.

Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 46 ++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6b7b28a2e..18efa496c 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -70,7 +70,7 @@ depth_to_mask(uint8_t depth)
 /*
  * Converts given depth value to its corresponding range value.
  */
-static inline uint32_t __attribute__((pure))
+static uint32_t __attribute__((pure))
 depth_to_range(uint8_t depth)
 {
 	VERIFY_DEPTH(depth);
@@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct rte_lpm *lpm),
  * are stored in the rule table from 0 - 31.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	uint8_t next_hop)
 {
@@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return rule_index;
 }
 
-static inline int32_t
+static int32_t
 rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 	uint32_t next_hop)
 {
@@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
  * Delete a rule from the rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline void
+static void
 rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 	lpm->rule_info[depth - 1].used_rules--;
 }
 
-static inline void
+static void
 rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
  * Finds a rule in rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -618,7 +618,7 @@ rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -642,7 +642,7 @@ rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 /*
  * Find, clean and allocate a tbl8.
  */
-static inline int32_t
+static int32_t
 tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -669,7 +669,7 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 	return -ENOSPC;
 }
 
-static inline int32_t
+static int32_t
 tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -695,21 +695,21 @@ tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 	return -ENOSPC;
 }
 
-static inline void
+static void
 tbl8_free_v20(struct rte_lpm_tbl_entry_v20 *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline void
+static void
 tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -1244,7 +1244,7 @@ BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
 MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm, uint32_t ip,
 		uint8_t depth, uint32_t *next_hop), rte_lpm_is_rule_present_v1604);
 
-static inline int32_t
+static int32_t
 find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
  * Return of value > -1 means tbl8 is in use but has all the same values and
  * thus can be recycled
  */
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 2/6] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 1/6] lib/lpm: not inline unnecessary functions Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 3/6] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++++
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 18efa496c..396ad94e2 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -1063,7 +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	/* Calculate the range and index into Table24. */
 	tbl24_range = depth_to_range(depth);
 	tbl24_index = (ip_masked >> 8);
+	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
 
 	/*
 	 * Firstly check the sub_rule_index. A -1 indicates no replacement rule
@@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1450,7 +1461,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1713,8 +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
 #undef group_idx
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index b886f54b4..6f5704c5c 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
 	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
 	tbl_entry = *ptbl;
 
+	/* Memory ordering is not required in lookup. Because dataflow
+	 * dependency exists, compiler or HW won't be able to re-order
+	 * the operations.
+	 */
 	/* Copy tbl8 entry (only if needed) */
 	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
 			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 3/6] lib/lpm: memory orderings to avoid race conditions for v20
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 1/6] lib/lpm: not inline unnecessary functions Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 2/6] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 4/6] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

Besides, explicit structure alignment is used to address atomic
operation building issue with older version clang.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++--
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 396ad94e2..0a94630db 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -938,7 +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1320,7 +1323,15 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				struct rte_lpm_tbl_entry_v20
+					zero_tbl24_entry = {
+						.valid = INVALID,
+						.depth = 0,
+						.valid_group = 0,
+					};
+					zero_tbl24_entry.next_hop = 0;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1365,7 +1376,8 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1647,8 +1659,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1659,8 +1674,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
 
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 6f5704c5c..906ec4483 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -88,7 +88,7 @@ struct rte_lpm_tbl_entry_v20 {
 	 */
 	uint8_t valid_group :1;
 	uint8_t depth       :6; /**< Rule depth. */
-};
+} __rte_aligned(sizeof(uint16_t));
 
 __extension__
 struct rte_lpm_tbl_entry {
@@ -121,7 +121,7 @@ struct rte_lpm_tbl_entry_v20 {
 		uint8_t group_idx;
 		uint8_t next_hop;
 	};
-};
+} __rte_aligned(sizeof(uint16_t));
 
 __extension__
 struct rte_lpm_tbl_entry {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 4/6] lib/lpm: use atomic store to avoid partial update
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
                     ` (2 preceding siblings ...)
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 3/6] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604 Ruifeng Wang
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20 Ruifeng Wang
  5 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

Compiler could generate non-atomic stores for whole table entry
updating. This may cause incorrect nexthop to be returned, if
the byte with valid flag is updated prior to the byte with nexthop
is updated.

Changed to use atomic store to update whole table entry.

Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 69 ++++++++++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 0a94630db..d35d64448 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -654,11 +654,19 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 		tbl8_entry = &tbl8[group_idx * RTE_LPM_TBL8_GROUP_NUM_ENTRIES];
 		/* If a free tbl8 group is found clean it and set as VALID. */
 		if (!tbl8_entry->valid_group) {
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = INVALID,
+				.depth = 0,
+				.valid_group = VALID,
+			};
+			new_tbl8_entry.next_hop = 0;
+
 			memset(&tbl8_entry[0], 0,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES *
 					sizeof(tbl8_entry[0]));
 
-			tbl8_entry->valid_group = VALID;
+			__atomic_store(tbl8_entry, &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 
 			/* Return group index for allocated tbl8 group. */
 			return group_idx;
@@ -680,11 +688,19 @@ tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 		tbl8_entry = &tbl8[group_idx * RTE_LPM_TBL8_GROUP_NUM_ENTRIES];
 		/* If a free tbl8 group is found clean it and set as VALID. */
 		if (!tbl8_entry->valid_group) {
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.next_hop = 0,
+				.valid = INVALID,
+				.depth = 0,
+				.valid_group = VALID,
+			};
+
 			memset(&tbl8_entry[0], 0,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES *
 					sizeof(tbl8_entry[0]));
 
-			tbl8_entry->valid_group = VALID;
+			__atomic_store(tbl8_entry, &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 
 			/* Return group index for allocated tbl8 group. */
 			return group_idx;
@@ -699,14 +715,25 @@ static void
 tbl8_free_v20(struct rte_lpm_tbl_entry_v20 *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
-	tbl8[tbl8_group_start].valid_group = INVALID;
+	struct rte_lpm_tbl_entry_v20 zero_tbl8_entry = {
+		.valid = INVALID,
+		.depth = 0,
+		.valid_group = INVALID,
+	};
+	zero_tbl8_entry.next_hop = 0;
+
+	__atomic_store(&tbl8[tbl8_group_start], &zero_tbl8_entry,
+			__ATOMIC_RELAXED);
 }
 
 static void
 tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
-	tbl8[tbl8_group_start].valid_group = INVALID;
+	struct rte_lpm_tbl_entry zero_tbl8_entry = {0};
+
+	__atomic_store(&tbl8[tbl8_group_start], &zero_tbl8_entry,
+			__ATOMIC_RELAXED);
 }
 
 static __rte_noinline int32_t
@@ -767,7 +794,9 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 					 * Setting tbl8 entry in one go to avoid
 					 * race conditions
 					 */
-					lpm->tbl8[j] = new_tbl8_entry;
+					__atomic_store(&lpm->tbl8[j],
+						&new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 					continue;
 				}
@@ -837,7 +866,9 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 					 * Setting tbl8 entry in one go to avoid
 					 * race conditions
 					 */
-					lpm->tbl8[j] = new_tbl8_entry;
+					__atomic_store(&lpm->tbl8[j],
+						&new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 					continue;
 				}
@@ -965,7 +996,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				 * Setting tbl8 entry in one go to avoid race
 				 * condition
 				 */
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 				continue;
 			}
@@ -1100,7 +1132,8 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				 * Setting tbl8 entry in one go to avoid race
 				 * condition
 				 */
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 				continue;
 			}
@@ -1393,7 +1426,9 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES); j++) {
 
 					if (lpm->tbl8[j].depth <= depth)
-						lpm->tbl8[j] = new_tbl8_entry;
+						__atomic_store(&lpm->tbl8[j],
+							&new_tbl8_entry,
+							__ATOMIC_RELAXED);
 				}
 			}
 		}
@@ -1490,7 +1525,9 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES); j++) {
 
 					if (lpm->tbl8[j].depth <= depth)
-						lpm->tbl8[j] = new_tbl8_entry;
+						__atomic_store(&lpm->tbl8[j],
+							&new_tbl8_entry,
+							__ATOMIC_RELAXED);
 				}
 			}
 		}
@@ -1646,7 +1683,8 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 		 */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
 			if (lpm->tbl8[i].depth <= depth)
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 		}
 	}
 
@@ -1677,7 +1715,8 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 		/* Set tbl24 before freeing tbl8 to avoid race condition.
 		 * Prevent the free of the tbl8 group from hoisting.
 		 */
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELAXED);
 		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
@@ -1730,7 +1769,8 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 		 */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
 			if (lpm->tbl8[i].depth <= depth)
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 		}
 	}
 
@@ -1761,7 +1801,8 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 		/* Set tbl24 before freeing tbl8 to avoid race condition.
 		 * Prevent the free of the tbl8 group from hoisting.
 		 */
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELAXED);
 		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
                     ` (3 preceding siblings ...)
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 4/6] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12 20:08     ` Honnappa Nagarahalli
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20 Ruifeng Wang
  5 siblings, 1 reply; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

The table entries were updated field by field. There were two issues:
1. bitwise operations are read-modify-write sequences and not atomic,
   nor efficient.
2. the above non-atomic operations causes entries out of synchronization
   and inconsistency.
This patch combines the fields into a one-go 32bit entry update to avoid
inconsistency and as a bonus save CPU cycles.

Suggested-by: Gavin Hu <gavin.hu@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index d35d64448..d86248713 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -1035,9 +1035,14 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Set tbl8 entry. */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
-			lpm->tbl8[i].valid = VALID;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -1075,19 +1080,28 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Populate new tbl8 with tbl24 value. */
 		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
-			lpm->tbl8[i].next_hop =
-					lpm->tbl24[tbl24_index].next_hop;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = lpm->tbl24[tbl24_index].depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = lpm->tbl24[tbl24_index].next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
 
 		/* Insert new rule into the tbl8 entry. */
 		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604 Ruifeng Wang
@ 2019-07-12 20:08     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 24+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-12 20:08 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China),
	vladimir.medvedkin, bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China),
	nd, Ruifeng Wang (Arm Technology China),
	Honnappa Nagarahalli, nd

IMO, this can be merged into 4/6.

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Thursday, July 11, 2019 10:09 PM
> To: vladimir.medvedkin@intel.com; bruce.richardson@intel.com
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>;
> Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Subject: [PATCH v5 5/6] lib/lpm: data update optimization for v1604
> 
> The table entries were updated field by field. There were two issues:
> 1. bitwise operations are read-modify-write sequences and not atomic,
>    nor efficient.
> 2. the above non-atomic operations causes entries out of synchronization
>    and inconsistency.
> This patch combines the fields into a one-go 32bit entry update to avoid
> inconsistency and as a bonus save CPU cycles.
> 
> Suggested-by: Gavin Hu <gavin.hu@arm.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_lpm/rte_lpm.c | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> d35d64448..d86248713 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -1035,9 +1035,14 @@ add_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked, uint8_t depth,
> 
>  		/* Set tbl8 entry. */
>  		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
> -			lpm->tbl8[i].depth = depth;
> -			lpm->tbl8[i].next_hop = next_hop;
> -			lpm->tbl8[i].valid = VALID;
> +			struct rte_lpm_tbl_entry new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +				.next_hop = next_hop,
> +			};
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		/*
> @@ -1075,19 +1080,28 @@ add_depth_big_v1604(struct rte_lpm *lpm,
> uint32_t ip_masked, uint8_t depth,
> 
>  		/* Populate new tbl8 with tbl24 value. */
>  		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
> -			lpm->tbl8[i].valid = VALID;
> -			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
> -			lpm->tbl8[i].next_hop =
> -					lpm->tbl24[tbl24_index].next_hop;
> +			struct rte_lpm_tbl_entry new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = lpm->tbl24[tbl24_index].depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +				.next_hop = lpm-
> >tbl24[tbl24_index].next_hop,
> +			};
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
> 
>  		/* Insert new rule into the tbl8 entry. */
>  		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
> -			lpm->tbl8[i].valid = VALID;
> -			lpm->tbl8[i].depth = depth;
> -			lpm->tbl8[i].next_hop = next_hop;
> +			struct rte_lpm_tbl_entry new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +				.next_hop = next_hop,
> +			};
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		/*
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
                     ` (4 preceding siblings ...)
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604 Ruifeng Wang
@ 2019-07-12  3:09   ` Ruifeng Wang
  2019-07-12 20:09     ` Honnappa Nagarahalli
  5 siblings, 1 reply; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-12  3:09 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

The table entries were updated field by field. There were two issues:
1. bitwise operations are read-modify-write sequences and not atomic,
   nor efficient.
2. the above non-atomic operations causes entries out of synchronization
   and inconsistency.
This patch combines the fields into a one-go 32bit entry update to avoid
inconsistency and as a bonus save CPU cycles.

Suggested-by: Gavin Hu <gavin.hu@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index d86248713..95e2f75aa 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -906,9 +906,14 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Set tbl8 entry. */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
-			lpm->tbl8[i].valid = VALID;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop = next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -943,19 +948,29 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Populate new tbl8 with tbl24 value. */
 		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
-			lpm->tbl8[i].next_hop =
-					lpm->tbl24[tbl24_index].next_hop;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = lpm->tbl24[tbl24_index].depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop =
+				lpm->tbl24[tbl24_index].next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
 
 		/* Insert new rule into the tbl8 entry. */
 		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop = next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20
  2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20 Ruifeng Wang
@ 2019-07-12 20:09     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 24+ messages in thread
From: Honnappa Nagarahalli @ 2019-07-12 20:09 UTC (permalink / raw)
  To: Ruifeng Wang (Arm Technology China),
	vladimir.medvedkin, bruce.richardson
  Cc: dev, Gavin Hu (Arm Technology China),
	nd, Ruifeng Wang (Arm Technology China),
	Honnappa Nagarahalli, nd

Similar to 5/6, this patch can also be merged into 4/6.

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Thursday, July 11, 2019 10:09 PM
> To: vladimir.medvedkin@intel.com; bruce.richardson@intel.com
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>;
> Ruifeng Wang (Arm Technology China) <Ruifeng.Wang@arm.com>
> Subject: [PATCH v5 6/6] lib/lpm: data update optimization for v20
> 
> The table entries were updated field by field. There were two issues:
> 1. bitwise operations are read-modify-write sequences and not atomic,
>    nor efficient.
> 2. the above non-atomic operations causes entries out of synchronization
>    and inconsistency.
> This patch combines the fields into a one-go 32bit entry update to avoid
> inconsistency and as a bonus save CPU cycles.
> 
> Suggested-by: Gavin Hu <gavin.hu@arm.com>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> ---
>  lib/librte_lpm/rte_lpm.c | 35 +++++++++++++++++++++++++----------
>  1 file changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c index
> d86248713..95e2f75aa 100644
> --- a/lib/librte_lpm/rte_lpm.c
> +++ b/lib/librte_lpm/rte_lpm.c
> @@ -906,9 +906,14 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked, uint8_t depth,
> 
>  		/* Set tbl8 entry. */
>  		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
> -			lpm->tbl8[i].depth = depth;
> -			lpm->tbl8[i].next_hop = next_hop;
> -			lpm->tbl8[i].valid = VALID;
> +			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +			};
> +			new_tbl8_entry.next_hop = next_hop;
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		/*
> @@ -943,19 +948,29 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm,
> uint32_t ip_masked, uint8_t depth,
> 
>  		/* Populate new tbl8 with tbl24 value. */
>  		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
> -			lpm->tbl8[i].valid = VALID;
> -			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
> -			lpm->tbl8[i].next_hop =
> -					lpm->tbl24[tbl24_index].next_hop;
> +			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = lpm->tbl24[tbl24_index].depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +			};
> +			new_tbl8_entry.next_hop =
> +				lpm->tbl24[tbl24_index].next_hop;
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
> 
>  		/* Insert new rule into the tbl8 entry. */
>  		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
> -			lpm->tbl8[i].valid = VALID;
> -			lpm->tbl8[i].depth = depth;
> -			lpm->tbl8[i].next_hop = next_hop;
> +			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
> +				.valid = VALID,
> +				.depth = depth,
> +				.valid_group = lpm->tbl8[i].valid_group,
> +			};
> +			new_tbl8_entry.next_hop = next_hop;
> +			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
> +					__ATOMIC_RELAXED);
>  		}
> 
>  		/*
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes
  2019-06-05  5:54 [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
                   ` (2 preceding siblings ...)
  2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
@ 2019-07-18  6:22 ` Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 1/4] lib/lpm: not inline unnecessary functions Ruifeng Wang
                     ` (4 more replies)
  3 siblings, 5 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-18  6:22 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

LPM4 uses DIR24-8 method of routing info data organization.
Routing rule with prefix longer than 24 bits will be stored
in a tbl24 entry together with an associated tbl8 group.

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

When memory orderings are imposed, API performance could drop.
However, a patch to de-inline unnecessary functions is added,
and this helps to keep API performance.

On Arm A72 platform, the whole patch series in general have no
notable performance impact.

On x86 E5-2640 platform, the whole patch series showed 2.6%
performance improvement on add, and no impact on lookup or
delete.

v6:
Merge multiple relaxed atomic store patches into one.
v5:
Fix typo in commit message.
Remove all other inlines.
Change to use atomic_store in tbl8_alloc / tbl8_free.
Merge multiple sigle field updates into one atomic store.
v4:
Change alignment attribute parameter.
Use atomic store to avoid partial update.
v3:
Use __rte_noinline to force no inline.
v2:
Fix clang building issue by supplying alignment attribute.

Ruifeng Wang (4):
  lib/lpm: not inline unnecessary functions
  lib/lpm: memory orderings to avoid race conditions for v1604
  lib/lpm: memory orderings to avoid race conditions for v20
  lib/lpm: use atomic store to avoid partial update

 lib/librte_lpm/rte_lpm.c | 248 ++++++++++++++++++++++++++++-----------
 lib/librte_lpm/rte_lpm.h |   8 +-
 2 files changed, 183 insertions(+), 73 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v6 1/4] lib/lpm: not inline unnecessary functions
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
@ 2019-07-18  6:22   ` Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 2/4] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-18  6:22 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

Tests showed that the function inlining caused performance drop
on some x86 platforms with the memory ordering patches applied.
By force no-inline functions, the performance was better than
before on x86 and no impact to arm64 platforms.

Besides inlines of other functions are removed to let compiler
to decide whether to inline.

Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 46 ++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 6b7b28a2e..18efa496c 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -70,7 +70,7 @@ depth_to_mask(uint8_t depth)
 /*
  * Converts given depth value to its corresponding range value.
  */
-static inline uint32_t __attribute__((pure))
+static uint32_t __attribute__((pure))
 depth_to_range(uint8_t depth)
 {
 	VERIFY_DEPTH(depth);
@@ -399,7 +399,7 @@ MAP_STATIC_SYMBOL(void rte_lpm_free(struct rte_lpm *lpm),
  * are stored in the rule table from 0 - 31.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	uint8_t next_hop)
 {
@@ -471,7 +471,7 @@ rule_add_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return rule_index;
 }
 
-static inline int32_t
+static int32_t
 rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 	uint32_t next_hop)
 {
@@ -547,7 +547,7 @@ rule_add_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
  * Delete a rule from the rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline void
+static void
 rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -570,7 +570,7 @@ rule_delete_v20(struct rte_lpm_v20 *lpm, int32_t rule_index, uint8_t depth)
 	lpm->rule_info[depth - 1].used_rules--;
 }
 
-static inline void
+static void
 rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
 {
 	int i;
@@ -597,7 +597,7 @@ rule_delete_v1604(struct rte_lpm *lpm, int32_t rule_index, uint8_t depth)
  * Finds a rule in rule table.
  * NOTE: Valid range for depth parameter is 1 .. 32 inclusive.
  */
-static inline int32_t
+static int32_t
 rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -618,7 +618,7 @@ rule_find_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth)
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 {
 	uint32_t rule_gindex, last_rule, rule_index;
@@ -642,7 +642,7 @@ rule_find_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth)
 /*
  * Find, clean and allocate a tbl8.
  */
-static inline int32_t
+static int32_t
 tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -669,7 +669,7 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 	return -ENOSPC;
 }
 
-static inline int32_t
+static int32_t
 tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 {
 	uint32_t group_idx; /* tbl8 group index. */
@@ -695,21 +695,21 @@ tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 	return -ENOSPC;
 }
 
-static inline void
+static void
 tbl8_free_v20(struct rte_lpm_tbl_entry_v20 *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline void
+static void
 tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
 	tbl8[tbl8_group_start].valid_group = INVALID;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -777,7 +777,7 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -846,7 +846,7 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 		uint8_t next_hop)
 {
@@ -971,7 +971,7 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 	return 0;
 }
 
-static inline int32_t
+static __rte_noinline int32_t
 add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 		uint32_t next_hop)
 {
@@ -1244,7 +1244,7 @@ BIND_DEFAULT_SYMBOL(rte_lpm_is_rule_present, _v1604, 16.04);
 MAP_STATIC_SYMBOL(int rte_lpm_is_rule_present(struct rte_lpm *lpm, uint32_t ip,
 		uint8_t depth, uint32_t *next_hop), rte_lpm_is_rule_present_v1604);
 
-static inline int32_t
+static int32_t
 find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1266,7 +1266,7 @@ find_previous_rule_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 		uint8_t *sub_rule_depth)
 {
@@ -1288,7 +1288,7 @@ find_previous_rule_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 	return -1;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1381,7 +1381,7 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1483,7 +1483,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
  * Return of value > -1 means tbl8 is in use but has all the same values and
  * thus can be recycled
  */
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1530,7 +1530,7 @@ tbl8_recycle_check_v20(struct rte_lpm_tbl_entry_v20 *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 		uint32_t tbl8_group_start)
 {
@@ -1577,7 +1577,7 @@ tbl8_recycle_check_v1604(struct rte_lpm_tbl_entry *tbl8,
 	return -EINVAL;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
@@ -1655,7 +1655,7 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	return 0;
 }
 
-static inline int32_t
+static int32_t
 delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	uint8_t depth, int32_t sub_rule_index, uint8_t sub_rule_depth)
 {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v6 2/4] lib/lpm: memory orderings to avoid race conditions for v1604
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 1/4] lib/lpm: not inline unnecessary functions Ruifeng Wang
@ 2019-07-18  6:22   ` Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 3/4] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-18  6:22 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

The ordering patches in general have no notable impact on LPM
performance test on both Arm A72 platform and x86 E5 platform.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++++
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 18efa496c..396ad94e2 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -806,7 +806,8 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -1017,7 +1018,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -1063,7 +1068,11 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		/* The tbl24 entry must be written only after the
+		 * tbl8 entries are written.
+		 */
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1391,6 +1400,7 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	/* Calculate the range and index into Table24. */
 	tbl24_range = depth_to_range(depth);
 	tbl24_index = (ip_masked >> 8);
+	struct rte_lpm_tbl_entry zero_tbl24_entry = {0};
 
 	/*
 	 * Firstly check the sub_rule_index. A -1 indicates no replacement rule
@@ -1405,7 +1415,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1450,7 +1461,8 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1713,8 +1725,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v1604(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1725,8 +1740,11 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
 #undef group_idx
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index b886f54b4..6f5704c5c 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -354,6 +354,10 @@ rte_lpm_lookup(struct rte_lpm *lpm, uint32_t ip, uint32_t *next_hop)
 	ptbl = (const uint32_t *)(&lpm->tbl24[tbl24_index]);
 	tbl_entry = *ptbl;
 
+	/* Memory ordering is not required in lookup. Because dataflow
+	 * dependency exists, compiler or HW won't be able to re-order
+	 * the operations.
+	 */
 	/* Copy tbl8 entry (only if needed) */
 	if (unlikely((tbl_entry & RTE_LPM_VALID_EXT_ENTRY_BITMASK) ==
 			RTE_LPM_VALID_EXT_ENTRY_BITMASK)) {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v6 3/4] lib/lpm: memory orderings to avoid race conditions for v20
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 1/4] lib/lpm: not inline unnecessary functions Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 2/4] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
@ 2019-07-18  6:22   ` Ruifeng Wang
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 4/4] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
  2019-07-18 14:00   ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Medvedkin, Vladimir
  4 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-18  6:22 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

When a tbl8 group is getting attached to a tbl24 entry, lookup
might fail even though the entry is configured in the table.

For ex: consider a LPM table configured with 10.10.10.1/24.
When a new entry 10.10.10.32/28 is being added, a new tbl8
group is allocated and tbl24 entry is changed to point to
the tbl8 group. If the tbl24 entry is written without the tbl8
group entries updated, a lookup on 10.10.10.9 will return
failure.

Correct memory orderings are required to ensure that the
store to tbl24 does not happen before the stores to tbl8 group
entries complete.

Besides, explicit structure alignment is used to address atomic
operation building issue with older version clang.

Suggested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 32 +++++++++++++++++++++++++-------
 lib/librte_lpm/rte_lpm.h |  4 ++--
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 396ad94e2..0a94630db 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -737,7 +737,8 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 			/* Setting tbl24 entry in one go to avoid race
 			 * conditions
 			 */
-			lpm->tbl24[i] = new_tbl24_entry;
+			__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+					__ATOMIC_RELEASE);
 
 			continue;
 		}
@@ -892,7 +893,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 			.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} /* If valid entry but not extended calculate the index into Table8. */
 	else if (lpm->tbl24[tbl24_index].valid_group == 0) {
@@ -938,7 +940,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				.depth = 0,
 		};
 
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELEASE);
 
 	} else { /*
 		* If it is valid, extended entry calculate the index into tbl8.
@@ -1320,7 +1323,15 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i].valid = INVALID;
+				struct rte_lpm_tbl_entry_v20
+					zero_tbl24_entry = {
+						.valid = INVALID,
+						.depth = 0,
+						.valid_group = 0,
+					};
+					zero_tbl24_entry.next_hop = 0;
+				__atomic_store(&lpm->tbl24[i],
+					&zero_tbl24_entry, __ATOMIC_RELEASE);
 			} else if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1365,7 +1376,8 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 
 			if (lpm->tbl24[i].valid_group == 0 &&
 					lpm->tbl24[i].depth <= depth) {
-				lpm->tbl24[i] = new_tbl24_entry;
+				__atomic_store(&lpm->tbl24[i], &new_tbl24_entry,
+						__ATOMIC_RELEASE);
 			} else  if (lpm->tbl24[i].valid_group == 1) {
 				/*
 				 * If TBL24 entry is extended, then there has
@@ -1647,8 +1659,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 	tbl8_recycle_index = tbl8_recycle_check_v20(lpm->tbl8, tbl8_group_start);
 
 	if (tbl8_recycle_index == -EINVAL) {
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index].valid = 0;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	} else if (tbl8_recycle_index > -1) {
 		/* Update tbl24 entry. */
@@ -1659,8 +1674,11 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 			.depth = lpm->tbl8[tbl8_recycle_index].depth,
 		};
 
-		/* Set tbl24 before freeing tbl8 to avoid race condition. */
+		/* Set tbl24 before freeing tbl8 to avoid race condition.
+		 * Prevent the free of the tbl8 group from hoisting.
+		 */
 		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
 
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 6f5704c5c..906ec4483 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -88,7 +88,7 @@ struct rte_lpm_tbl_entry_v20 {
 	 */
 	uint8_t valid_group :1;
 	uint8_t depth       :6; /**< Rule depth. */
-};
+} __rte_aligned(sizeof(uint16_t));
 
 __extension__
 struct rte_lpm_tbl_entry {
@@ -121,7 +121,7 @@ struct rte_lpm_tbl_entry_v20 {
 		uint8_t group_idx;
 		uint8_t next_hop;
 	};
-};
+} __rte_aligned(sizeof(uint16_t));
 
 __extension__
 struct rte_lpm_tbl_entry {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [dpdk-dev] [PATCH v6 4/4] lib/lpm: use atomic store to avoid partial update
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
                     ` (2 preceding siblings ...)
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 3/4] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
@ 2019-07-18  6:22   ` Ruifeng Wang
  2019-07-18 14:00   ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Medvedkin, Vladimir
  4 siblings, 0 replies; 24+ messages in thread
From: Ruifeng Wang @ 2019-07-18  6:22 UTC (permalink / raw)
  To: vladimir.medvedkin, bruce.richardson
  Cc: dev, honnappa.nagarahalli, gavin.hu, nd, Ruifeng Wang

Compiler could generate non-atomic stores for whole table entry
updating. This may cause incorrect nexthop to be returned, if
the byte with valid flag is updated prior to the byte with nexthop
is updated.
Besides, field by field updating of table entries follow
read-modify-write sequences. The operations are not atomic,
nor efficient. And could cause entries out of synchronization.

Changed to use atomic store to update whole table entry.

Suggested-by: Medvedkin Vladimir <vladimir.medvedkin@intel.com>
Suggested-by: Gavin Hu <gavin.hu@arm.com>
Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_lpm/rte_lpm.c | 138 +++++++++++++++++++++++++++++----------
 1 file changed, 104 insertions(+), 34 deletions(-)

diff --git a/lib/librte_lpm/rte_lpm.c b/lib/librte_lpm/rte_lpm.c
index 0a94630db..95e2f75aa 100644
--- a/lib/librte_lpm/rte_lpm.c
+++ b/lib/librte_lpm/rte_lpm.c
@@ -654,11 +654,19 @@ tbl8_alloc_v20(struct rte_lpm_tbl_entry_v20 *tbl8)
 		tbl8_entry = &tbl8[group_idx * RTE_LPM_TBL8_GROUP_NUM_ENTRIES];
 		/* If a free tbl8 group is found clean it and set as VALID. */
 		if (!tbl8_entry->valid_group) {
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = INVALID,
+				.depth = 0,
+				.valid_group = VALID,
+			};
+			new_tbl8_entry.next_hop = 0;
+
 			memset(&tbl8_entry[0], 0,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES *
 					sizeof(tbl8_entry[0]));
 
-			tbl8_entry->valid_group = VALID;
+			__atomic_store(tbl8_entry, &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 
 			/* Return group index for allocated tbl8 group. */
 			return group_idx;
@@ -680,11 +688,19 @@ tbl8_alloc_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t number_tbl8s)
 		tbl8_entry = &tbl8[group_idx * RTE_LPM_TBL8_GROUP_NUM_ENTRIES];
 		/* If a free tbl8 group is found clean it and set as VALID. */
 		if (!tbl8_entry->valid_group) {
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.next_hop = 0,
+				.valid = INVALID,
+				.depth = 0,
+				.valid_group = VALID,
+			};
+
 			memset(&tbl8_entry[0], 0,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES *
 					sizeof(tbl8_entry[0]));
 
-			tbl8_entry->valid_group = VALID;
+			__atomic_store(tbl8_entry, &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 
 			/* Return group index for allocated tbl8 group. */
 			return group_idx;
@@ -699,14 +715,25 @@ static void
 tbl8_free_v20(struct rte_lpm_tbl_entry_v20 *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
-	tbl8[tbl8_group_start].valid_group = INVALID;
+	struct rte_lpm_tbl_entry_v20 zero_tbl8_entry = {
+		.valid = INVALID,
+		.depth = 0,
+		.valid_group = INVALID,
+	};
+	zero_tbl8_entry.next_hop = 0;
+
+	__atomic_store(&tbl8[tbl8_group_start], &zero_tbl8_entry,
+			__ATOMIC_RELAXED);
 }
 
 static void
 tbl8_free_v1604(struct rte_lpm_tbl_entry *tbl8, uint32_t tbl8_group_start)
 {
 	/* Set tbl8 group invalid*/
-	tbl8[tbl8_group_start].valid_group = INVALID;
+	struct rte_lpm_tbl_entry zero_tbl8_entry = {0};
+
+	__atomic_store(&tbl8[tbl8_group_start], &zero_tbl8_entry,
+			__ATOMIC_RELAXED);
 }
 
 static __rte_noinline int32_t
@@ -767,7 +794,9 @@ add_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip, uint8_t depth,
 					 * Setting tbl8 entry in one go to avoid
 					 * race conditions
 					 */
-					lpm->tbl8[j] = new_tbl8_entry;
+					__atomic_store(&lpm->tbl8[j],
+						&new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 					continue;
 				}
@@ -837,7 +866,9 @@ add_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip, uint8_t depth,
 					 * Setting tbl8 entry in one go to avoid
 					 * race conditions
 					 */
-					lpm->tbl8[j] = new_tbl8_entry;
+					__atomic_store(&lpm->tbl8[j],
+						&new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 					continue;
 				}
@@ -875,9 +906,14 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Set tbl8 entry. */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
-			lpm->tbl8[i].valid = VALID;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop = next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -912,19 +948,29 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Populate new tbl8 with tbl24 value. */
 		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
-			lpm->tbl8[i].next_hop =
-					lpm->tbl24[tbl24_index].next_hop;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = lpm->tbl24[tbl24_index].depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop =
+				lpm->tbl24[tbl24_index].next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
 
 		/* Insert new rule into the tbl8 entry. */
 		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
+			struct rte_lpm_tbl_entry_v20 new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+			};
+			new_tbl8_entry.next_hop = next_hop;
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -965,7 +1011,8 @@ add_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked, uint8_t depth,
 				 * Setting tbl8 entry in one go to avoid race
 				 * condition
 				 */
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 				continue;
 			}
@@ -1003,9 +1050,14 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Set tbl8 entry. */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
-			lpm->tbl8[i].valid = VALID;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -1043,19 +1095,28 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 
 		/* Populate new tbl8 with tbl24 value. */
 		for (i = tbl8_group_start; i < tbl8_group_end; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = lpm->tbl24[tbl24_index].depth;
-			lpm->tbl8[i].next_hop =
-					lpm->tbl24[tbl24_index].next_hop;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = lpm->tbl24[tbl24_index].depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = lpm->tbl24[tbl24_index].next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		tbl8_index = tbl8_group_start + (ip_masked & 0xFF);
 
 		/* Insert new rule into the tbl8 entry. */
 		for (i = tbl8_index; i < tbl8_index + tbl8_range; i++) {
-			lpm->tbl8[i].valid = VALID;
-			lpm->tbl8[i].depth = depth;
-			lpm->tbl8[i].next_hop = next_hop;
+			struct rte_lpm_tbl_entry new_tbl8_entry = {
+				.valid = VALID,
+				.depth = depth,
+				.valid_group = lpm->tbl8[i].valid_group,
+				.next_hop = next_hop,
+			};
+			__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+					__ATOMIC_RELAXED);
 		}
 
 		/*
@@ -1100,7 +1161,8 @@ add_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked, uint8_t depth,
 				 * Setting tbl8 entry in one go to avoid race
 				 * condition
 				 */
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 
 				continue;
 			}
@@ -1393,7 +1455,9 @@ delete_depth_small_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES); j++) {
 
 					if (lpm->tbl8[j].depth <= depth)
-						lpm->tbl8[j] = new_tbl8_entry;
+						__atomic_store(&lpm->tbl8[j],
+							&new_tbl8_entry,
+							__ATOMIC_RELAXED);
 				}
 			}
 		}
@@ -1490,7 +1554,9 @@ delete_depth_small_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 					RTE_LPM_TBL8_GROUP_NUM_ENTRIES); j++) {
 
 					if (lpm->tbl8[j].depth <= depth)
-						lpm->tbl8[j] = new_tbl8_entry;
+						__atomic_store(&lpm->tbl8[j],
+							&new_tbl8_entry,
+							__ATOMIC_RELAXED);
 				}
 			}
 		}
@@ -1646,7 +1712,8 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 		 */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
 			if (lpm->tbl8[i].depth <= depth)
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 		}
 	}
 
@@ -1677,7 +1744,8 @@ delete_depth_big_v20(struct rte_lpm_v20 *lpm, uint32_t ip_masked,
 		/* Set tbl24 before freeing tbl8 to avoid race condition.
 		 * Prevent the free of the tbl8 group from hoisting.
 		 */
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELAXED);
 		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v20(lpm->tbl8, tbl8_group_start);
 	}
@@ -1730,7 +1798,8 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 		 */
 		for (i = tbl8_index; i < (tbl8_index + tbl8_range); i++) {
 			if (lpm->tbl8[i].depth <= depth)
-				lpm->tbl8[i] = new_tbl8_entry;
+				__atomic_store(&lpm->tbl8[i], &new_tbl8_entry,
+						__ATOMIC_RELAXED);
 		}
 	}
 
@@ -1761,7 +1830,8 @@ delete_depth_big_v1604(struct rte_lpm *lpm, uint32_t ip_masked,
 		/* Set tbl24 before freeing tbl8 to avoid race condition.
 		 * Prevent the free of the tbl8 group from hoisting.
 		 */
-		lpm->tbl24[tbl24_index] = new_tbl24_entry;
+		__atomic_store(&lpm->tbl24[tbl24_index], &new_tbl24_entry,
+				__ATOMIC_RELAXED);
 		__atomic_thread_fence(__ATOMIC_RELEASE);
 		tbl8_free_v1604(lpm->tbl8, tbl8_group_start);
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes
  2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
                     ` (3 preceding siblings ...)
  2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 4/4] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
@ 2019-07-18 14:00   ` Medvedkin, Vladimir
  2019-07-19 10:37     ` Thomas Monjalon
  4 siblings, 1 reply; 24+ messages in thread
From: Medvedkin, Vladimir @ 2019-07-18 14:00 UTC (permalink / raw)
  To: Ruifeng Wang, bruce.richardson; +Cc: dev, honnappa.nagarahalli, gavin.hu, nd

Hi Wang,

On 18/07/2019 07:22, Ruifeng Wang wrote:
> LPM4 uses DIR24-8 method of routing info data organization.
> Routing rule with prefix longer than 24 bits will be stored
> in a tbl24 entry together with an associated tbl8 group.
>
> When a tbl8 group is getting attached to a tbl24 entry, lookup
> might fail even though the entry is configured in the table.
>
> For ex: consider a LPM table configured with 10.10.10.1/24.
> When a new entry 10.10.10.32/28 is being added, a new tbl8
> group is allocated and tbl24 entry is changed to point to
> the tbl8 group. If the tbl24 entry is written without the tbl8
> group entries updated, a lookup on 10.10.10.9 will return
> failure.
>
> Correct memory orderings are required to ensure that the
> store to tbl24 does not happen before the stores to tbl8 group
> entries complete.
>
> When memory orderings are imposed, API performance could drop.
> However, a patch to de-inline unnecessary functions is added,
> and this helps to keep API performance.
>
> On Arm A72 platform, the whole patch series in general have no
> notable performance impact.
>
> On x86 E5-2640 platform, the whole patch series showed 2.6%
> performance improvement on add, and no impact on lookup or
> delete.
>
>
> v6:
> Merge multiple relaxed atomic store patches into one.
> v5:
> Fix typo in commit message.
> Remove all other inlines.
> Change to use atomic_store in tbl8_alloc / tbl8_free.
> Merge multiple sigle field updates into one atomic store.
> v4:
> Change alignment attribute parameter.
> Use atomic store to avoid partial update.
> v3:
> Use __rte_noinline to force no inline.
> v2:
> Fix clang building issue by supplying alignment attribute.
>
> Ruifeng Wang (4):
>    lib/lpm: not inline unnecessary functions
>    lib/lpm: memory orderings to avoid race conditions for v1604
>    lib/lpm: memory orderings to avoid race conditions for v20
>    lib/lpm: use atomic store to avoid partial update
>
>   lib/librte_lpm/rte_lpm.c | 248 ++++++++++++++++++++++++++++-----------
>   lib/librte_lpm/rte_lpm.h |   8 +-
>   2 files changed, 183 insertions(+), 73 deletions(-)

Looks ok to me, thanks!

Series-acked-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>

-- 
Regards,
Vladimir


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes
  2019-07-18 14:00   ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Medvedkin, Vladimir
@ 2019-07-19 10:37     ` Thomas Monjalon
  0 siblings, 0 replies; 24+ messages in thread
From: Thomas Monjalon @ 2019-07-19 10:37 UTC (permalink / raw)
  To: Ruifeng Wang
  Cc: dev, Medvedkin, Vladimir, bruce.richardson, honnappa.nagarahalli,
	gavin.hu, nd

18/07/2019 16:00, Medvedkin, Vladimir:
> On 18/07/2019 07:22, Ruifeng Wang wrote:
> > Ruifeng Wang (4):
> >    lib/lpm: not inline unnecessary functions
> >    lib/lpm: memory orderings to avoid race conditions for v1604
> >    lib/lpm: memory orderings to avoid race conditions for v20
> >    lib/lpm: use atomic store to avoid partial update
> 
> Looks ok to me, thanks!
> 
> Series-acked-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>

Applied, thanks




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-07-19 10:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05  5:54 [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
2019-06-05  5:54 ` [dpdk-dev] [PATCH v1 2/2] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
2019-06-05 10:50 ` [dpdk-dev] [PATCH v1 1/2] lib/lpm: memory orderings to avoid race conditions for v1604 Medvedkin, Vladimir
2019-06-05 14:12   ` Ruifeng Wang (Arm Technology China)
2019-06-05 19:23     ` Honnappa Nagarahalli
2019-06-10 15:22       ` Medvedkin, Vladimir
2019-06-17 15:27         ` Ruifeng Wang (Arm Technology China)
2019-06-17 15:33           ` Medvedkin, Vladimir
2019-07-12  3:09 ` [dpdk-dev] [PATCH v5 0/6] LPM4 memory ordering changes Ruifeng Wang
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 1/6] lib/lpm: not inline unnecessary functions Ruifeng Wang
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 2/6] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 3/6] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 4/6] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 5/6] lib/lpm: data update optimization for v1604 Ruifeng Wang
2019-07-12 20:08     ` Honnappa Nagarahalli
2019-07-12  3:09   ` [dpdk-dev] [PATCH v5 6/6] lib/lpm: data update optimization for v20 Ruifeng Wang
2019-07-12 20:09     ` Honnappa Nagarahalli
2019-07-18  6:22 ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Ruifeng Wang
2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 1/4] lib/lpm: not inline unnecessary functions Ruifeng Wang
2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 2/4] lib/lpm: memory orderings to avoid race conditions for v1604 Ruifeng Wang
2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 3/4] lib/lpm: memory orderings to avoid race conditions for v20 Ruifeng Wang
2019-07-18  6:22   ` [dpdk-dev] [PATCH v6 4/4] lib/lpm: use atomic store to avoid partial update Ruifeng Wang
2019-07-18 14:00   ` [dpdk-dev] [PATCH v6 0/4] LPM4 memory ordering changes Medvedkin, Vladimir
2019-07-19 10:37     ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).