[dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx
@ 2021-03-18  7:18 Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache Feifei Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Feifei Wang @ 2021-03-18  7:18 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

For net/mlx4 and net/mlx5, fix cache rebuild bug and replace SMP
barriers with atomic fence.

Feifei Wang (4):
  net/mlx4: fix rebuild bug for Memory Region cache
  net/mlx4: replace SMP barrier with C11 barriers
  net/mlx5: fix rebuild bug for Memory Region cache
  net/mlx5: replace SMP barriers with C11 barriers

 drivers/net/mlx4/mlx4_mr.c | 21 +++++++++----------
 drivers/net/mlx5/mlx5_mr.c | 41 ++++++++++++++++++--------------------
 2 files changed, 28 insertions(+), 34 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
@ 2021-03-18  7:18 ` Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 2/4] net/mlx4: replace SMP barrier with C11 barriers Feifei Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-03-18  7:18 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler, Yongseok Koh
  Cc: dev, nd, Feifei Wang, stable, Ruifeng Wang

'dev_gen' is a variable to inform other cores to flush their local cache
when global cache is rebuilt.

However, if 'dev_gen' is updated after global cache is rebuilt, other
cores may load a wrong memory region lkey value from old local cache.

Timeslot        main core               worker core
  1         rebuild global cache
  2                                  load unchanged dev_gen
  3            update dev_gen
  4                                  look up old local cache

From the example above, we can see that though global cache is rebuilt,
due to that dev_gen is not updated, the worker core may look up old
cache table and receive a wrong memory region lkey value.

To fix this, updating 'dev_gen' should be moved before rebuilding global
cache to inform worker cores to flush their local cache when global
cache start rebuilding. And wmb can ensure the sequence of this process.

Fixes: 9797bfcce1c9 ("net/mlx4: add new memory region support")
Cc: stable@dpdk.org

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx4/mlx4_mr.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_mr.c b/drivers/net/mlx4/mlx4_mr.c
index 6b2f0cf18..cfd7d4a9c 100644
--- a/drivers/net/mlx4/mlx4_mr.c
+++ b/drivers/net/mlx4/mlx4_mr.c
@@ -946,20 +946,17 @@ mlx4_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr, size_t len)
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mr_rebuild_dev_cache(dev);
-		/*
-		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
-		 */
 		++priv->mr.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
-		      priv->mr.dev_gen);
+			priv->mr.dev_gen);
+
+		/* Flush local caches by propagating invalidation across cores.
+		 * rte_smp_wmb is to keep the order that dev_gen updated before
+		 * rebuilding global cache. Therefore, other core can flush their
+		 * local cache on time.
+		 */
 		rte_smp_wmb();
+		mr_rebuild_dev_cache(dev);
 	}
 	rte_rwlock_write_unlock(&priv->mr.rwlock);
 #ifdef RTE_LIBRTE_MLX4_DEBUG
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 2/4] net/mlx4: replace SMP barrier with C11 barriers
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache Feifei Wang
@ 2021-03-18  7:18 ` Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache Feifei Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-03-18  7:18 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Feifei Wang, Ruifeng Wang

Replace SMP barrier with atomic thread fence.

Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx4/mlx4_mr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_mr.c b/drivers/net/mlx4/mlx4_mr.c
index cfd7d4a9c..503e8a7bb 100644
--- a/drivers/net/mlx4/mlx4_mr.c
+++ b/drivers/net/mlx4/mlx4_mr.c
@@ -951,11 +951,11 @@ mlx4_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr, size_t len)
 			priv->mr.dev_gen);
 
 		/* Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb is to keep the order that dev_gen updated before
+		 * release-fence is to keep the order that dev_gen updated before
 		 * rebuilding global cache. Therefore, other core can flush their
 		 * local cache on time.
 		 */
-		rte_smp_wmb();
+		rte_atomic_thread_fence(__ATOMIC_RELEASE);
 		mr_rebuild_dev_cache(dev);
 	}
 	rte_rwlock_write_unlock(&priv->mr.rwlock);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache Feifei Wang
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 2/4] net/mlx4: replace SMP barrier with C11 barriers Feifei Wang
@ 2021-03-18  7:18 ` Feifei Wang
  2021-04-12  8:27   ` Slava Ovsiienko
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 4/4] net/mlx5: replace SMP barriers with C11 barriers Feifei Wang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-03-18  7:18 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko, Yongseok Koh
  Cc: dev, nd, Feifei Wang, stable, Ruifeng Wang

'dev_gen' is a variable to inform other cores to flush their local cache
when global cache is rebuilt.

However, if 'dev_gen' is updated after global cache is rebuilt, other
cores may load a wrong memory region lkey value from old local cache.

Timeslot        main core               worker core
  1         rebuild global cache
  2                                  load unchanged dev_gen
  3            update dev_gen
  4                                  look up old local cache

From the example above, we can see that though global cache is rebuilt,
due to that dev_gen is not updated, the worker core may look up old
cache table and receive a wrong memory region lkey value.

To fix this, updating 'dev_gen' should be moved before rebuilding global
cache to inform worker cores to flush their local cache when global
cache start rebuilding. And wmb can ensure the sequence of this process.

Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
Cc: stable@dpdk.org

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_mr.c | 37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index da4e91fc2..7ce1d3e64 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct mlx5_dev_ctx_shared *sh,
 		rebuild = 1;
 	}
 	if (rebuild) {
-		mlx5_mr_rebuild_cache(&sh->share_cache);
+		++sh->share_cache.dev_gen;
+		DEBUG("broadcasting local cache flush, gen=%d",
+			sh->share_cache.dev_gen);
+
 		/*
 		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
+		 * rte_smp_wmb() is to keep the order that dev_gen updated before
+		 * rebuilding global cache. Therefore, other core can flush their
+		 * local cache on time.
 		 */
-		++sh->share_cache.dev_gen;
-		DEBUG("broadcasting local cache flush, gen=%d",
-		      sh->share_cache.dev_gen);
 		rte_smp_wmb();
+		mlx5_mr_rebuild_cache(&sh->share_cache);
 	}
 	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
@@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
 	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
 	      (void *)mr);
-	mlx5_mr_rebuild_cache(&sh->share_cache);
+
+	++sh->share_cache.dev_gen;
+	DEBUG("broadcasting local cache flush, gen=%d",
+		sh->share_cache.dev_gen);
+
 	/*
 	 * Flush local caches by propagating invalidation across cores.
-	 * rte_smp_wmb() is enough to synchronize this event. If one of
-	 * freed memsegs is seen by other core, that means the memseg
-	 * has been allocated by allocator, which will come after this
-	 * free call. Therefore, this store instruction (incrementing
-	 * generation below) will be guaranteed to be seen by other core
-	 * before the core sees the newly allocated memory.
+	 * rte_smp_wmb() is to keep the order that dev_gen updated before
+	 * rebuilding global cache. Therefore, other core can flush their
+	 * local cache on time.
 	 */
-	++sh->share_cache.dev_gen;
-	DEBUG("broadcasting local cache flush, gen=%d",
-	      sh->share_cache.dev_gen);
 	rte_smp_wmb();
+	mlx5_mr_rebuild_cache(&sh->share_cache);
 	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache Feifei Wang
@ 2021-04-12  8:27   ` Slava Ovsiienko
  2021-04-13  5:20     ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-04-12  8:27 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler, Yongseok Koh
  Cc: dev, nd, stable, Ruifeng Wang

Hi, Feifei

Sorry, I do not follow what this patch fixes. Do we have some issue/bug with MR cache in practice?

Each Tx queue has its own dedicated "local" cache for MRs to convert buffer address in mbufs being transmitted
to LKeys (HW-related entity handle) and the "global" cache for all MR registered on the device.

AFAIK, how conversion happens in datapath:
- check the local queue cache flush request
- lookup in local cache
- if not found:
- acquire lock for global cache read access
- lookup in global cache
- release lock for global cache

How cache update on memory freeing/unregistering happens:
- acquire lock for global cache write access 
- [a] remove relevant MRs from the global cache
- [b] set local caches flush request 
- free global cache lock

If I understand correctly, your patch swaps [a] and [b],
and local caches flush is requested earlier. What problem does it solve?
It is not supposed there are in datapath some mbufs referencing
to the memory being freed. Application must ensure this and must not allocate
new mbufs from this memory regions being freed. Hence, the lookups for these
MRs in caches should not occur.

For other side, the cache flush has negative effect - the local cache is
getting empty and can't provide translation for other valid (not being removed) MRs,
and the translation has to look up in the global cache, that is locked now
for rebuilding, this causes the delays in datapatch on acquiring global cache lock.
So, I see some potential performance impact. 

With best regards,
Slava

> -----Original Message-----
> From: Feifei Wang <feifei.wang2@arm.com>
> Sent: Thursday, March 18, 2021 9:19
> To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> Yongseok Koh <yskoh@mellanox.com>
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> 'dev_gen' is a variable to inform other cores to flush their local cache when
> global cache is rebuilt.
> 
> However, if 'dev_gen' is updated after global cache is rebuilt, other cores
> may load a wrong memory region lkey value from old local cache.
> 
> Timeslot        main core               worker core
>   1         rebuild global cache
>   2                                  load unchanged dev_gen
>   3            update dev_gen
>   4                                  look up old local cache
> 
> From the example above, we can see that though global cache is rebuilt, due
> to that dev_gen is not updated, the worker core may look up old cache table
> and receive a wrong memory region lkey value.
> 
> To fix this, updating 'dev_gen' should be moved before rebuilding global
> cache to inform worker cores to flush their local cache when global cache
> start rebuilding. And wmb can ensure the sequence of this process.
> 
> Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> Cc: stable@dpdk.org
> 
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/mlx5/mlx5_mr.c | 37 +++++++++++++++++--------------------
>  1 file changed, 17 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c index
> da4e91fc2..7ce1d3e64 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_dev_ctx_shared *sh,
>  		rebuild = 1;
>  	}
>  	if (rebuild) {
> -		mlx5_mr_rebuild_cache(&sh->share_cache);
> +		++sh->share_cache.dev_gen;
> +		DEBUG("broadcasting local cache flush, gen=%d",
> +			sh->share_cache.dev_gen);
> +
>  		/*
>  		 * Flush local caches by propagating invalidation across cores.
> -		 * rte_smp_wmb() is enough to synchronize this event. If
> one of
> -		 * freed memsegs is seen by other core, that means the
> memseg
> -		 * has been allocated by allocator, which will come after this
> -		 * free call. Therefore, this store instruction (incrementing
> -		 * generation below) will be guaranteed to be seen by other
> core
> -		 * before the core sees the newly allocated memory.
> +		 * rte_smp_wmb() is to keep the order that dev_gen
> updated before
> +		 * rebuilding global cache. Therefore, other core can flush
> their
> +		 * local cache on time.
>  		 */
> -		++sh->share_cache.dev_gen;
> -		DEBUG("broadcasting local cache flush, gen=%d",
> -		      sh->share_cache.dev_gen);
>  		rte_smp_wmb();
> +		mlx5_mr_rebuild_cache(&sh->share_cache);
>  	}
>  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  }
> @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
>  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
>  	      (void *)mr);
> -	mlx5_mr_rebuild_cache(&sh->share_cache);
> +
> +	++sh->share_cache.dev_gen;
> +	DEBUG("broadcasting local cache flush, gen=%d",
> +		sh->share_cache.dev_gen);
> +
>  	/*
>  	 * Flush local caches by propagating invalidation across cores.
> -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> -	 * freed memsegs is seen by other core, that means the memseg
> -	 * has been allocated by allocator, which will come after this
> -	 * free call. Therefore, this store instruction (incrementing
> -	 * generation below) will be guaranteed to be seen by other core
> -	 * before the core sees the newly allocated memory.
> +	 * rte_smp_wmb() is to keep the order that dev_gen updated
> before
> +	 * rebuilding global cache. Therefore, other core can flush their
> +	 * local cache on time.
>  	 */
> -	++sh->share_cache.dev_gen;
> -	DEBUG("broadcasting local cache flush, gen=%d",
> -	      sh->share_cache.dev_gen);
>  	rte_smp_wmb();
> +	mlx5_mr_rebuild_cache(&sh->share_cache);
>  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
>  	return 0;
>  }
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-12  8:27   ` Slava Ovsiienko
@ 2021-04-13  5:20     ` Feifei Wang
  2021-04-19 18:50       ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-04-13  5:20 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler, yskoh
  Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Slava

Thanks very much for your attention.





Best Regards
Feifei

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年4月12日 16:28
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>;
> yskoh@mellanox.com
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Sorry, I do not follow what this patch fixes. Do we have some issue/bug with
> MR cache in practice?

This patch fixes the bug which is based on logical deduction, 
and it doesn't actually happen.

> 
> Each Tx queue has its own dedicated "local" cache for MRs to convert buffer
> address in mbufs being transmitted to LKeys (HW-related entity handle) and
> the "global" cache for all MR registered on the device.
> 
> AFAIK, how conversion happens in datapath:
> - check the local queue cache flush request
> - lookup in local cache
> - if not found:
> - acquire lock for global cache read access
> - lookup in global cache
> - release lock for global cache
> 
> How cache update on memory freeing/unregistering happens:
> - acquire lock for global cache write access
> - [a] remove relevant MRs from the global cache
> - [b] set local caches flush request
> - free global cache lock
> 
> If I understand correctly, your patch swaps [a] and [b], and local caches flush
> is requested earlier. What problem does it solve?
> It is not supposed there are in datapath some mbufs referencing to the
> memory being freed. Application must ensure this and must not allocate new
> mbufs from this memory regions being freed. Hence, the lookups for these
> MRs in caches should not occur.

For your first point that, application can take charge of preventing MR freed memory
being allocated to data path.

Does it means that If there is an emergency of MR fragment, such as hotplug, the application
must inform thedata path in advance, and this memory will not be allocated, and then the
control path will free this memory? If application  can do like this, I agree that this bug
cannot happen.

> For other side, the cache flush has negative effect - the local cache is getting
> empty and can't provide translation for other valid (not being removed) MRs,
> and the translation has to look up in the global cache, that is locked now for
> rebuilding, this causes the delays in datapatch on acquiring global cache lock.
> So, I see some potential performance impact.

If above assumption is true, we can go to your second point. I think this is a problem
of the tradeoff between cache coherence and performance.  

I can understand your meaning that though global cache has been changed, we should 
keep the valid MR in local cache as long as possible to ensure the fast searching speed. 
In the meanwhile, the local cache can be rebuilt later to reduce its waiting time for
acquiring the global cache lock.

However,  this mechanism just ensures the performance unchanged for  the first few mbufs. 
During the next mbufs lkey searching after 'dev_gen' updated, it is still necessary to update
the local cache. And the performance can firstly reduce and then returns. Thus, no matter
whether there is this patch or not,  the performance will jitter in a certain period of time. 

Finally, in conclusion, I tend to think that the bottom layer can do more things to ensure
the correct execution of the program, which may have a negative impact on the performance in
a short time, but in the long run, the performance will eventually come back.  Furthermore,
maybe we should pay attention to the performance in the stable period, and try our best to ensure the
correctness of the program in case of emergencies.

Best Regards
Feifei

> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Feifei Wang <feifei.wang2@arm.com>
> > Sent: Thursday, March 18, 2021 9:19
> > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > Yongseok Koh <yskoh@mellanox.com>
> > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > 'dev_gen' is a variable to inform other cores to flush their local
> > cache when global cache is rebuilt.
> >
> > However, if 'dev_gen' is updated after global cache is rebuilt, other
> > cores may load a wrong memory region lkey value from old local cache.
> >
> > Timeslot        main core               worker core
> >   1         rebuild global cache
> >   2                                  load unchanged dev_gen
> >   3            update dev_gen
> >   4                                  look up old local cache
> >
> > From the example above, we can see that though global cache is
> > rebuilt, due to that dev_gen is not updated, the worker core may look
> > up old cache table and receive a wrong memory region lkey value.
> >
> > To fix this, updating 'dev_gen' should be moved before rebuilding
> > global cache to inform worker cores to flush their local cache when
> > global cache start rebuilding. And wmb can ensure the sequence of this
> process.
> >
> > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> > Cc: stable@dpdk.org
> >
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/mlx5/mlx5_mr.c | 37 +++++++++++++++++--------------------
> >  1 file changed, 17 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> > index
> > da4e91fc2..7ce1d3e64 100644
> > --- a/drivers/net/mlx5/mlx5_mr.c
> > +++ b/drivers/net/mlx5/mlx5_mr.c
> > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > mlx5_dev_ctx_shared *sh,
> >  		rebuild = 1;
> >  	}
> >  	if (rebuild) {
> > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > +		++sh->share_cache.dev_gen;
> > +		DEBUG("broadcasting local cache flush, gen=%d",
> > +			sh->share_cache.dev_gen);
> > +
> >  		/*
> >  		 * Flush local caches by propagating invalidation across cores.
> > -		 * rte_smp_wmb() is enough to synchronize this event. If
> > one of
> > -		 * freed memsegs is seen by other core, that means the
> > memseg
> > -		 * has been allocated by allocator, which will come after this
> > -		 * free call. Therefore, this store instruction (incrementing
> > -		 * generation below) will be guaranteed to be seen by other
> > core
> > -		 * before the core sees the newly allocated memory.
> > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > updated before
> > +		 * rebuilding global cache. Therefore, other core can flush
> > their
> > +		 * local cache on time.
> >  		 */
> > -		++sh->share_cache.dev_gen;
> > -		DEBUG("broadcasting local cache flush, gen=%d",
> > -		      sh->share_cache.dev_gen);
> >  		rte_smp_wmb();
> > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> >  	}
> >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> >  }
> > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void
> > *addr,
> >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> >  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
> >  	      (void *)mr);
> > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > +
> > +	++sh->share_cache.dev_gen;
> > +	DEBUG("broadcasting local cache flush, gen=%d",
> > +		sh->share_cache.dev_gen);
> > +
> >  	/*
> >  	 * Flush local caches by propagating invalidation across cores.
> > -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> > -	 * freed memsegs is seen by other core, that means the memseg
> > -	 * has been allocated by allocator, which will come after this
> > -	 * free call. Therefore, this store instruction (incrementing
> > -	 * generation below) will be guaranteed to be seen by other core
> > -	 * before the core sees the newly allocated memory.
> > +	 * rte_smp_wmb() is to keep the order that dev_gen updated
> > before
> > +	 * rebuilding global cache. Therefore, other core can flush their
> > +	 * local cache on time.
> >  	 */
> > -	++sh->share_cache.dev_gen;
> > -	DEBUG("broadcasting local cache flush, gen=%d",
> > -	      sh->share_cache.dev_gen);
> >  	rte_smp_wmb();
> > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> >  	return 0;
> >  }
> > --
> > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-13  5:20     ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-04-19 18:50       ` Slava Ovsiienko
  2021-04-20  5:53         ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-04-19 18:50 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Feifei

Please, see below

....

> > Hi, Feifei
> >
> > Sorry, I do not follow what this patch fixes. Do we have some
> > issue/bug with MR cache in practice?
> 
> This patch fixes the bug which is based on logical deduction, and it doesn't
> actually happen.
> 
> >
> > Each Tx queue has its own dedicated "local" cache for MRs to convert
> > buffer address in mbufs being transmitted to LKeys (HW-related entity
> > handle) and the "global" cache for all MR registered on the device.
> >
> > AFAIK, how conversion happens in datapath:
> > - check the local queue cache flush request
> > - lookup in local cache
> > - if not found:
> > - acquire lock for global cache read access
> > - lookup in global cache
> > - release lock for global cache
> >
> > How cache update on memory freeing/unregistering happens:
> > - acquire lock for global cache write access
> > - [a] remove relevant MRs from the global cache
> > - [b] set local caches flush request
> > - free global cache lock
> >
> > If I understand correctly, your patch swaps [a] and [b], and local
> > caches flush is requested earlier. What problem does it solve?
> > It is not supposed there are in datapath some mbufs referencing to the
> > memory being freed. Application must ensure this and must not allocate
> > new mbufs from this memory regions being freed. Hence, the lookups for
> > these MRs in caches should not occur.
> 
> For your first point that, application can take charge of preventing MR freed
> memory being allocated to data path.
> 
> Does it means that If there is an emergency of MR fragment, such as hotplug,
> the application must inform thedata path in advance, and this memory will
> not be allocated, and then the control path will free this memory? If
> application  can do like this, I agree that this bug cannot happen.

Actually,  this is the only correct way for application to operate.
Let's suppose we have some memory area that application wants
to free. ALL references to this area must be removed. If we have
some mbufs allocated from this area, it means that we have memory pool
created there.

What application should do:
- notify all its components/agents the memory area is going to be freed
- all components/agents free the mbufs they might own
- PMD might not support freeing for some mbufs (for example being
sent and awaiting for completion), so app should just wait
- wait till all mbufs are returned to the memory pool (by monitoring available obj == pool size)

Otherwise - it is dangerous to free the memory. There are just some mbufs still allocated,
it is regardless to buf address to MR translation. We just can't free the memory - the mapping
will be destroyed and might cause the segmentation fault by SW or some HW issues on
DMA access to unmapped memory.  It is very generic safety approach - do not free the memory
that is still in use. Hence, at the moment of freeing and unregistering the MR,
there MUST BE NO any mbufs in flight referencing to the addresses
being freed. No translation to MR being invalidated can happen.

> 
> > For other side, the cache flush has negative effect - the local cache
> > is getting empty and can't provide translation for other valid (not
> > being removed) MRs, and the translation has to look up in the global
> > cache, that is locked now for rebuilding, this causes the delays in datapatch
> on acquiring global cache lock.
> > So, I see some potential performance impact.
> 
> If above assumption is true, we can go to your second point. I think this is a
> problem of the tradeoff between cache coherence and performance.
> 
> I can understand your meaning that though global cache has been changed,
> we should keep the valid MR in local cache as long as possible to ensure the
> fast searching speed.
> In the meanwhile, the local cache can be rebuilt later to reduce its waiting
> time for acquiring the global cache lock.
> 
> However,  this mechanism just ensures the performance unchanged for  the
> first few mbufs.
> During the next mbufs lkey searching after 'dev_gen' updated, it is still
> necessary to update the local cache. And the performance can firstly reduce
> and then returns. Thus, no matter whether there is this patch or not,  the
> performance will jitter in a certain period of time.

Local cache should be updated to remove MRs no longer valid. But we
just flush the entire cache.
Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
And there are traffic in the datapath for MR0 and MR1, and no traffic
for MRX anymore.

1) If we do as you propose:
a) take a lock
b) request flush local cache first - all MR0, MR1, MRX will be removed on translation in datapath
c) update global cache,
d) free lock
All the traffic for valid MR0, MR1 ALWAYS will be blocked
on lock taken for cache update since point b) till point d).

2) If we do as it is implemented now:
a) take a lock
b) update global cache
c) request flush local cache
d) free lock
The traffic MIGHT be locked ONLY for MRs non-existing in local cache (not happens for MR0
and MR1, must not happen for MRX),  and probability should be minor. And lock might happen
since c) till d) - quite short period of time

Summary, the difference between 1) and 2) 

Lock probability:
- 1) lock ALWAYS happen for ANY MR translation after b),
  2) lock MIGHT happen, for cache miss ONLY, after c)

Lock duration:
- 1) lock since b) till d),
  2) lock since c) till d), that seems to be  much shorter.

> 
> Finally, in conclusion, I tend to think that the bottom layer can do more things
> to ensure the correct execution of the program, which may have a negative
> impact on the performance in a short time, but in the long run, the
> performance will eventually come back.  Furthermore, maybe we should pay
> attention to the performance in the stable period, and try our best to ensure
> the correctness of the program in case of emergencies.

If we have some mbufs still allocated in memory being freed - there is nothing
to say about correctness, it is totally incorrect. In my opinion, we should not think how
to mitigate this incorrect behavior, we should not encourage application developers
to follow the wrong approaches.

With best regards,
Slava

> 
> Best Regards
> Feifei
> 
> > With best regards,
> > Slava
> >
> > > -----Original Message-----
> > > From: Feifei Wang <feifei.wang2@arm.com>
> > > Sent: Thursday, March 18, 2021 9:19
> > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > Yongseok Koh <yskoh@mellanox.com>
> > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > 'dev_gen' is a variable to inform other cores to flush their local
> > > cache when global cache is rebuilt.
> > >
> > > However, if 'dev_gen' is updated after global cache is rebuilt,
> > > other cores may load a wrong memory region lkey value from old local
> cache.
> > >
> > > Timeslot        main core               worker core
> > >   1         rebuild global cache
> > >   2                                  load unchanged dev_gen
> > >   3            update dev_gen
> > >   4                                  look up old local cache
> > >
> > > From the example above, we can see that though global cache is
> > > rebuilt, due to that dev_gen is not updated, the worker core may
> > > look up old cache table and receive a wrong memory region lkey value.
> > >
> > > To fix this, updating 'dev_gen' should be moved before rebuilding
> > > global cache to inform worker cores to flush their local cache when
> > > global cache start rebuilding. And wmb can ensure the sequence of
> > > this
> > process.
> > >
> > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> > > Cc: stable@dpdk.org
> > >
> > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > +++++++++++++++++--------------------
> > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> > > index
> > > da4e91fc2..7ce1d3e64 100644
> > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > mlx5_dev_ctx_shared *sh,
> > >  		rebuild = 1;
> > >  	}
> > >  	if (rebuild) {
> > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > +		++sh->share_cache.dev_gen;
> > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > +			sh->share_cache.dev_gen);
> > > +
> > >  		/*
> > >  		 * Flush local caches by propagating invalidation across cores.
> > > -		 * rte_smp_wmb() is enough to synchronize this event. If
> > > one of
> > > -		 * freed memsegs is seen by other core, that means the
> > > memseg
> > > -		 * has been allocated by allocator, which will come after this
> > > -		 * free call. Therefore, this store instruction (incrementing
> > > -		 * generation below) will be guaranteed to be seen by other
> > > core
> > > -		 * before the core sees the newly allocated memory.
> > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > updated before
> > > +		 * rebuilding global cache. Therefore, other core can flush
> > > their
> > > +		 * local cache on time.
> > >  		 */
> > > -		++sh->share_cache.dev_gen;
> > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > -		      sh->share_cache.dev_gen);
> > >  		rte_smp_wmb();
> > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > >  	}
> > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > >  }
> > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> *pdev,
> > void
> > > *addr,
> > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > >  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
> > >  	      (void *)mr);
> > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > +
> > > +	++sh->share_cache.dev_gen;
> > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > +		sh->share_cache.dev_gen);
> > > +
> > >  	/*
> > >  	 * Flush local caches by propagating invalidation across cores.
> > > -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> > > -	 * freed memsegs is seen by other core, that means the memseg
> > > -	 * has been allocated by allocator, which will come after this
> > > -	 * free call. Therefore, this store instruction (incrementing
> > > -	 * generation below) will be guaranteed to be seen by other core
> > > -	 * before the core sees the newly allocated memory.
> > > +	 * rte_smp_wmb() is to keep the order that dev_gen updated
> > > before
> > > +	 * rebuilding global cache. Therefore, other core can flush their
> > > +	 * local cache on time.
> > >  	 */
> > > -	++sh->share_cache.dev_gen;
> > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > -	      sh->share_cache.dev_gen);
> > >  	rte_smp_wmb();
> > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > >  	return 0;
> > >  }
> > > --
> > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-19 18:50       ` [dpdk-dev] " Slava Ovsiienko
@ 2021-04-20  5:53         ` Feifei Wang
  2021-04-20  7:29           ` Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-04-20  5:53 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd

Hi, Slava

Thanks very much for your explanation.

I can understand the app can wait all mbufs are returned to the memory pool,
and then it can free this mbufs, I agree with this.

As a result, I will remove the bug fix patch from this series and just replace the smp barrier
with C11 thread fence. Thanks very much for your patient explanation again.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年4月20日 2:51
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Please, see below
> 
> ....
> 
> > > Hi, Feifei
> > >
> > > Sorry, I do not follow what this patch fixes. Do we have some
> > > issue/bug with MR cache in practice?
> >
> > This patch fixes the bug which is based on logical deduction, and it
> > doesn't actually happen.
> >
> > >
> > > Each Tx queue has its own dedicated "local" cache for MRs to convert
> > > buffer address in mbufs being transmitted to LKeys (HW-related
> > > entity
> > > handle) and the "global" cache for all MR registered on the device.
> > >
> > > AFAIK, how conversion happens in datapath:
> > > - check the local queue cache flush request
> > > - lookup in local cache
> > > - if not found:
> > > - acquire lock for global cache read access
> > > - lookup in global cache
> > > - release lock for global cache
> > >
> > > How cache update on memory freeing/unregistering happens:
> > > - acquire lock for global cache write access
> > > - [a] remove relevant MRs from the global cache
> > > - [b] set local caches flush request
> > > - free global cache lock
> > >
> > > If I understand correctly, your patch swaps [a] and [b], and local
> > > caches flush is requested earlier. What problem does it solve?
> > > It is not supposed there are in datapath some mbufs referencing to
> > > the memory being freed. Application must ensure this and must not
> > > allocate new mbufs from this memory regions being freed. Hence, the
> > > lookups for these MRs in caches should not occur.
> >
> > For your first point that, application can take charge of preventing
> > MR freed memory being allocated to data path.
> >
> > Does it means that If there is an emergency of MR fragment, such as
> > hotplug, the application must inform thedata path in advance, and this
> > memory will not be allocated, and then the control path will free this
> > memory? If application  can do like this, I agree that this bug cannot happen.
> 
> Actually,  this is the only correct way for application to operate.
> Let's suppose we have some memory area that application wants to free. ALL
> references to this area must be removed. If we have some mbufs allocated
> from this area, it means that we have memory pool created there.
> 
> What application should do:
> - notify all its components/agents the memory area is going to be freed
> - all components/agents free the mbufs they might own
> - PMD might not support freeing for some mbufs (for example being sent
> and awaiting for completion), so app should just wait
> - wait till all mbufs are returned to the memory pool (by monitoring available
> obj == pool size)
> 
> Otherwise - it is dangerous to free the memory. There are just some mbufs
> still allocated, it is regardless to buf address to MR translation. We just can't
> free the memory - the mapping will be destroyed and might cause the
> segmentation fault by SW or some HW issues on DMA access to unmapped
> memory.  It is very generic safety approach - do not free the memory that is
> still in use. Hence, at the moment of freeing and unregistering the MR, there
> MUST BE NO any mbufs in flight referencing to the addresses being freed.
> No translation to MR being invalidated can happen.
> 
> >
> > > For other side, the cache flush has negative effect - the local
> > > cache is getting empty and can't provide translation for other valid
> > > (not being removed) MRs, and the translation has to look up in the
> > > global cache, that is locked now for rebuilding, this causes the
> > > delays in datapatch
> > on acquiring global cache lock.
> > > So, I see some potential performance impact.
> >
> > If above assumption is true, we can go to your second point. I think
> > this is a problem of the tradeoff between cache coherence and
> performance.
> >
> > I can understand your meaning that though global cache has been
> > changed, we should keep the valid MR in local cache as long as
> > possible to ensure the fast searching speed.
> > In the meanwhile, the local cache can be rebuilt later to reduce its
> > waiting time for acquiring the global cache lock.
> >
> > However,  this mechanism just ensures the performance unchanged for
> > the first few mbufs.
> > During the next mbufs lkey searching after 'dev_gen' updated, it is
> > still necessary to update the local cache. And the performance can
> > firstly reduce and then returns. Thus, no matter whether there is this
> > patch or not,  the performance will jitter in a certain period of time.
> 
> Local cache should be updated to remove MRs no longer valid. But we just
> flush the entire cache.
> Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
> And there are traffic in the datapath for MR0 and MR1, and no traffic for MRX
> anymore.
> 
> 1) If we do as you propose:
> a) take a lock
> b) request flush local cache first - all MR0, MR1, MRX will be removed on
> translation in datapath
> c) update global cache,
> d) free lock
> All the traffic for valid MR0, MR1 ALWAYS will be blocked on lock taken for
> cache update since point b) till point d).
> 
> 2) If we do as it is implemented now:
> a) take a lock
> b) update global cache
> c) request flush local cache
> d) free lock
> The traffic MIGHT be locked ONLY for MRs non-existing in local cache (not
> happens for MR0 and MR1, must not happen for MRX),  and probability
> should be minor. And lock might happen since c) till d) - quite short period of
> time
> 
> Summary, the difference between 1) and 2)
> 
> Lock probability:
> - 1) lock ALWAYS happen for ANY MR translation after b),
>   2) lock MIGHT happen, for cache miss ONLY, after c)
> 
> Lock duration:
> - 1) lock since b) till d),
>   2) lock since c) till d), that seems to be  much shorter.
> 
> >
> > Finally, in conclusion, I tend to think that the bottom layer can do
> > more things to ensure the correct execution of the program, which may
> > have a negative impact on the performance in a short time, but in the
> > long run, the performance will eventually come back.  Furthermore,
> > maybe we should pay attention to the performance in the stable period,
> > and try our best to ensure the correctness of the program in case of
> emergencies.
> 
> If we have some mbufs still allocated in memory being freed - there is
> nothing to say about correctness, it is totally incorrect. In my opinion, we
> should not think how to mitigate this incorrect behavior, we should not
> encourage application developers to follow the wrong approaches.
> 
> With best regards,
> Slava
> 
> >
> > Best Regards
> > Feifei
> >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > Sent: Thursday, March 18, 2021 9:19
> > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > Yongseok Koh <yskoh@mellanox.com>
> > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> <feifei.wang2@arm.com>;
> > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > 'dev_gen' is a variable to inform other cores to flush their local
> > > > cache when global cache is rebuilt.
> > > >
> > > > However, if 'dev_gen' is updated after global cache is rebuilt,
> > > > other cores may load a wrong memory region lkey value from old
> > > > local
> > cache.
> > > >
> > > > Timeslot        main core               worker core
> > > >   1         rebuild global cache
> > > >   2                                  load unchanged dev_gen
> > > >   3            update dev_gen
> > > >   4                                  look up old local cache
> > > >
> > > > From the example above, we can see that though global cache is
> > > > rebuilt, due to that dev_gen is not updated, the worker core may
> > > > look up old cache table and receive a wrong memory region lkey value.
> > > >
> > > > To fix this, updating 'dev_gen' should be moved before rebuilding
> > > > global cache to inform worker cores to flush their local cache
> > > > when global cache start rebuilding. And wmb can ensure the
> > > > sequence of this
> > > process.
> > > >
> > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > +++++++++++++++++--------------------
> > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > da4e91fc2..7ce1d3e64 100644
> > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > mlx5_dev_ctx_shared *sh,
> > > >  		rebuild = 1;
> > > >  	}
> > > >  	if (rebuild) {
> > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > +		++sh->share_cache.dev_gen;
> > > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > > +			sh->share_cache.dev_gen);
> > > > +
> > > >  		/*
> > > >  		 * Flush local caches by propagating invalidation across cores.
> > > > -		 * rte_smp_wmb() is enough to synchronize this event. If
> > > > one of
> > > > -		 * freed memsegs is seen by other core, that means the
> > > > memseg
> > > > -		 * has been allocated by allocator, which will come after this
> > > > -		 * free call. Therefore, this store instruction (incrementing
> > > > -		 * generation below) will be guaranteed to be seen by other
> > > > core
> > > > -		 * before the core sees the newly allocated memory.
> > > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > > updated before
> > > > +		 * rebuilding global cache. Therefore, other core can flush
> > > > their
> > > > +		 * local cache on time.
> > > >  		 */
> > > > -		++sh->share_cache.dev_gen;
> > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > -		      sh->share_cache.dev_gen);
> > > >  		rte_smp_wmb();
> > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > >  	}
> > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > >  }
> > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> > *pdev,
> > > void
> > > > *addr,
> > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > >  	DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
> > > >  	      (void *)mr);
> > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > +
> > > > +	++sh->share_cache.dev_gen;
> > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > +		sh->share_cache.dev_gen);
> > > > +
> > > >  	/*
> > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> > > > -	 * freed memsegs is seen by other core, that means the memseg
> > > > -	 * has been allocated by allocator, which will come after this
> > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > -	 * generation below) will be guaranteed to be seen by other core
> > > > -	 * before the core sees the newly allocated memory.
> > > > +	 * rte_smp_wmb() is to keep the order that dev_gen updated
> > > > before
> > > > +	 * rebuilding global cache. Therefore, other core can flush their
> > > > +	 * local cache on time.
> > > >  	 */
> > > > -	++sh->share_cache.dev_gen;
> > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > -	      sh->share_cache.dev_gen);
> > > >  	rte_smp_wmb();
> > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > >  	return 0;
> > > >  }
> > > > --
> > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-20  5:53         ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-04-20  7:29           ` Feifei Wang
  2021-04-20  7:53             ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-04-20  7:29 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd, nd

Hi, Slava

Another question suddenly occurred to me, in order to keep the order that rebuilding global cache
before updating ”dev_gen“, the wmb should be before updating "dev_gen" rather than after it.
Otherwise, in the out-of-order platforms, current order cannot be kept.

Thus, we should change the code as:
a) rebuild global cache;
b) rte_smp_wmb();
c) updating dev_gen

Best Regards
Feifei
> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: 2021年4月20日 13:54
> 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Thanks very much for your explanation.
> 
> I can understand the app can wait all mbufs are returned to the memory pool,
> and then it can free this mbufs, I agree with this.
> 
> As a result, I will remove the bug fix patch from this series and just replace
> the smp barrier with C11 thread fence. Thanks very much for your patient
> explanation again.
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > 发送时间: 2021年4月20日 2:51
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > Please, see below
> >
> > ....
> >
> > > > Hi, Feifei
> > > >
> > > > Sorry, I do not follow what this patch fixes. Do we have some
> > > > issue/bug with MR cache in practice?
> > >
> > > This patch fixes the bug which is based on logical deduction, and it
> > > doesn't actually happen.
> > >
> > > >
> > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > convert buffer address in mbufs being transmitted to LKeys
> > > > (HW-related entity
> > > > handle) and the "global" cache for all MR registered on the device.
> > > >
> > > > AFAIK, how conversion happens in datapath:
> > > > - check the local queue cache flush request
> > > > - lookup in local cache
> > > > - if not found:
> > > > - acquire lock for global cache read access
> > > > - lookup in global cache
> > > > - release lock for global cache
> > > >
> > > > How cache update on memory freeing/unregistering happens:
> > > > - acquire lock for global cache write access
> > > > - [a] remove relevant MRs from the global cache
> > > > - [b] set local caches flush request
> > > > - free global cache lock
> > > >
> > > > If I understand correctly, your patch swaps [a] and [b], and local
> > > > caches flush is requested earlier. What problem does it solve?
> > > > It is not supposed there are in datapath some mbufs referencing to
> > > > the memory being freed. Application must ensure this and must not
> > > > allocate new mbufs from this memory regions being freed. Hence,
> > > > the lookups for these MRs in caches should not occur.
> > >
> > > For your first point that, application can take charge of preventing
> > > MR freed memory being allocated to data path.
> > >
> > > Does it means that If there is an emergency of MR fragment, such as
> > > hotplug, the application must inform thedata path in advance, and
> > > this memory will not be allocated, and then the control path will
> > > free this memory? If application  can do like this, I agree that this bug
> cannot happen.
> >
> > Actually,  this is the only correct way for application to operate.
> > Let's suppose we have some memory area that application wants to free.
> > ALL references to this area must be removed. If we have some mbufs
> > allocated from this area, it means that we have memory pool created there.
> >
> > What application should do:
> > - notify all its components/agents the memory area is going to be
> > freed
> > - all components/agents free the mbufs they might own
> > - PMD might not support freeing for some mbufs (for example being sent
> > and awaiting for completion), so app should just wait
> > - wait till all mbufs are returned to the memory pool (by monitoring
> > available obj == pool size)
> >
> > Otherwise - it is dangerous to free the memory. There are just some
> > mbufs still allocated, it is regardless to buf address to MR
> > translation. We just can't free the memory - the mapping will be
> > destroyed and might cause the segmentation fault by SW or some HW
> > issues on DMA access to unmapped memory.  It is very generic safety
> > approach - do not free the memory that is still in use. Hence, at the
> > moment of freeing and unregistering the MR, there MUST BE NO any
> mbufs in flight referencing to the addresses being freed.
> > No translation to MR being invalidated can happen.
> >
> > >
> > > > For other side, the cache flush has negative effect - the local
> > > > cache is getting empty and can't provide translation for other
> > > > valid (not being removed) MRs, and the translation has to look up
> > > > in the global cache, that is locked now for rebuilding, this
> > > > causes the delays in datapatch
> > > on acquiring global cache lock.
> > > > So, I see some potential performance impact.
> > >
> > > If above assumption is true, we can go to your second point. I think
> > > this is a problem of the tradeoff between cache coherence and
> > performance.
> > >
> > > I can understand your meaning that though global cache has been
> > > changed, we should keep the valid MR in local cache as long as
> > > possible to ensure the fast searching speed.
> > > In the meanwhile, the local cache can be rebuilt later to reduce its
> > > waiting time for acquiring the global cache lock.
> > >
> > > However,  this mechanism just ensures the performance unchanged for
> > > the first few mbufs.
> > > During the next mbufs lkey searching after 'dev_gen' updated, it is
> > > still necessary to update the local cache. And the performance can
> > > firstly reduce and then returns. Thus, no matter whether there is
> > > this patch or not,  the performance will jitter in a certain period of time.
> >
> > Local cache should be updated to remove MRs no longer valid. But we
> > just flush the entire cache.
> > Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
> > And there are traffic in the datapath for MR0 and MR1, and no traffic
> > for MRX anymore.
> >
> > 1) If we do as you propose:
> > a) take a lock
> > b) request flush local cache first - all MR0, MR1, MRX will be removed
> > on translation in datapath
> > c) update global cache,
> > d) free lock
> > All the traffic for valid MR0, MR1 ALWAYS will be blocked on lock
> > taken for cache update since point b) till point d).
> >
> > 2) If we do as it is implemented now:
> > a) take a lock
> > b) update global cache
> > c) request flush local cache
> > d) free lock
> > The traffic MIGHT be locked ONLY for MRs non-existing in local cache
> > (not happens for MR0 and MR1, must not happen for MRX),  and
> > probability should be minor. And lock might happen since c) till d) -
> > quite short period of time
> >
> > Summary, the difference between 1) and 2)
> >
> > Lock probability:
> > - 1) lock ALWAYS happen for ANY MR translation after b),
> >   2) lock MIGHT happen, for cache miss ONLY, after c)
> >
> > Lock duration:
> > - 1) lock since b) till d),
> >   2) lock since c) till d), that seems to be  much shorter.
> >
> > >
> > > Finally, in conclusion, I tend to think that the bottom layer can do
> > > more things to ensure the correct execution of the program, which
> > > may have a negative impact on the performance in a short time, but
> > > in the long run, the performance will eventually come back.
> > > Furthermore, maybe we should pay attention to the performance in the
> > > stable period, and try our best to ensure the correctness of the
> > > program in case of
> > emergencies.
> >
> > If we have some mbufs still allocated in memory being freed - there is
> > nothing to say about correctness, it is totally incorrect. In my
> > opinion, we should not think how to mitigate this incorrect behavior,
> > we should not encourage application developers to follow the wrong
> approaches.
> >
> > With best regards,
> > Slava
> >
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > With best regards,
> > > > Slava
> > > >
> > > > > -----Original Message-----
> > > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > > Yongseok Koh <yskoh@mellanox.com>
> > > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> > <feifei.wang2@arm.com>;
> > > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > 'dev_gen' is a variable to inform other cores to flush their
> > > > > local cache when global cache is rebuilt.
> > > > >
> > > > > However, if 'dev_gen' is updated after global cache is rebuilt,
> > > > > other cores may load a wrong memory region lkey value from old
> > > > > local
> > > cache.
> > > > >
> > > > > Timeslot        main core               worker core
> > > > >   1         rebuild global cache
> > > > >   2                                  load unchanged dev_gen
> > > > >   3            update dev_gen
> > > > >   4                                  look up old local cache
> > > > >
> > > > > From the example above, we can see that though global cache is
> > > > > rebuilt, due to that dev_gen is not updated, the worker core may
> > > > > look up old cache table and receive a wrong memory region lkey value.
> > > > >
> > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > rebuilding global cache to inform worker cores to flush their
> > > > > local cache when global cache start rebuilding. And wmb can
> > > > > ensure the sequence of this
> > > > process.
> > > > >
> > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > ---
> > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > +++++++++++++++++--------------------
> > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > da4e91fc2..7ce1d3e64 100644
> > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > > mlx5_dev_ctx_shared *sh,
> > > > >  		rebuild = 1;
> > > > >  	}
> > > > >  	if (rebuild) {
> > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > +		++sh->share_cache.dev_gen;
> > > > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > +			sh->share_cache.dev_gen);
> > > > > +
> > > > >  		/*
> > > > >  		 * Flush local caches by propagating invalidation
> across cores.
> > > > > -		 * rte_smp_wmb() is enough to synchronize this
> event. If
> > > > > one of
> > > > > -		 * freed memsegs is seen by other core, that means
> the
> > > > > memseg
> > > > > -		 * has been allocated by allocator, which will come
> after this
> > > > > -		 * free call. Therefore, this store instruction
> (incrementing
> > > > > -		 * generation below) will be guaranteed to be seen
> by other
> > > > > core
> > > > > -		 * before the core sees the newly allocated memory.
> > > > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > updated before
> > > > > +		 * rebuilding global cache. Therefore, other core can
> flush
> > > > > their
> > > > > +		 * local cache on time.
> > > > >  		 */
> > > > > -		++sh->share_cache.dev_gen;
> > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > -		      sh->share_cache.dev_gen);
> > > > >  		rte_smp_wmb();
> > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > >  	}
> > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > >  }
> > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> > > *pdev,
> > > > void
> > > > > *addr,
> > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> >port_id,
> > > > >  	      (void *)mr);
> > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > +
> > > > > +	++sh->share_cache.dev_gen;
> > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > +		sh->share_cache.dev_gen);
> > > > > +
> > > > >  	/*
> > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> one of
> > > > > -	 * freed memsegs is seen by other core, that means the
> memseg
> > > > > -	 * has been allocated by allocator, which will come after this
> > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > -	 * generation below) will be guaranteed to be seen by other
> core
> > > > > -	 * before the core sees the newly allocated memory.
> > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> updated
> > > > > before
> > > > > +	 * rebuilding global cache. Therefore, other core can flush
> their
> > > > > +	 * local cache on time.
> > > > >  	 */
> > > > > -	++sh->share_cache.dev_gen;
> > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > -	      sh->share_cache.dev_gen);
> > > > >  	rte_smp_wmb();
> > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > >  	return 0;
> > > > >  }
> > > > > --
> > > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-20  7:29           ` Feifei Wang
@ 2021-04-20  7:53             ` Slava Ovsiienko
  2021-04-20  8:42               ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-04-20  7:53 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd, nd

Hi, Feifei

In my opinion, there should be 2 barriers:
 - after global cache update/before altering dev_gen, to ensure the correct order
 - after altering dev_gen to make this change visible for other agents and to trigger local cache update

With best regards,
Slava

> -----Original Message-----
> From: Feifei Wang <Feifei.Wang2@arm.com>
> Sent: Tuesday, April 20, 2021 10:30
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Another question suddenly occurred to me, in order to keep the order that
> rebuilding global cache before updating ”dev_gen“, the wmb should be
> before updating "dev_gen" rather than after it.
> Otherwise, in the out-of-order platforms, current order cannot be kept.
> 
> Thus, we should change the code as:
> a) rebuild global cache;
> b) rte_smp_wmb();
> c) updating dev_gen
> 
> Best Regards
> Feifei
> > -----邮件原件-----
> > 发件人: Feifei Wang
> > 发送时间: 2021年4月20日 13:54
> > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Slava
> >
> > Thanks very much for your explanation.
> >
> > I can understand the app can wait all mbufs are returned to the memory
> > pool, and then it can free this mbufs, I agree with this.
> >
> > As a result, I will remove the bug fix patch from this series and just
> > replace the smp barrier with C11 thread fence. Thanks very much for
> > your patient explanation again.
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > 发送时间: 2021年4月20日 2:51
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Feifei
> > >
> > > Please, see below
> > >
> > > ....
> > >
> > > > > Hi, Feifei
> > > > >
> > > > > Sorry, I do not follow what this patch fixes. Do we have some
> > > > > issue/bug with MR cache in practice?
> > > >
> > > > This patch fixes the bug which is based on logical deduction, and
> > > > it doesn't actually happen.
> > > >
> > > > >
> > > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > > convert buffer address in mbufs being transmitted to LKeys
> > > > > (HW-related entity
> > > > > handle) and the "global" cache for all MR registered on the device.
> > > > >
> > > > > AFAIK, how conversion happens in datapath:
> > > > > - check the local queue cache flush request
> > > > > - lookup in local cache
> > > > > - if not found:
> > > > > - acquire lock for global cache read access
> > > > > - lookup in global cache
> > > > > - release lock for global cache
> > > > >
> > > > > How cache update on memory freeing/unregistering happens:
> > > > > - acquire lock for global cache write access
> > > > > - [a] remove relevant MRs from the global cache
> > > > > - [b] set local caches flush request
> > > > > - free global cache lock
> > > > >
> > > > > If I understand correctly, your patch swaps [a] and [b], and
> > > > > local caches flush is requested earlier. What problem does it solve?
> > > > > It is not supposed there are in datapath some mbufs referencing
> > > > > to the memory being freed. Application must ensure this and must
> > > > > not allocate new mbufs from this memory regions being freed.
> > > > > Hence, the lookups for these MRs in caches should not occur.
> > > >
> > > > For your first point that, application can take charge of
> > > > preventing MR freed memory being allocated to data path.
> > > >
> > > > Does it means that If there is an emergency of MR fragment, such
> > > > as hotplug, the application must inform thedata path in advance,
> > > > and this memory will not be allocated, and then the control path
> > > > will free this memory? If application  can do like this, I agree
> > > > that this bug
> > cannot happen.
> > >
> > > Actually,  this is the only correct way for application to operate.
> > > Let's suppose we have some memory area that application wants to free.
> > > ALL references to this area must be removed. If we have some mbufs
> > > allocated from this area, it means that we have memory pool created
> there.
> > >
> > > What application should do:
> > > - notify all its components/agents the memory area is going to be
> > > freed
> > > - all components/agents free the mbufs they might own
> > > - PMD might not support freeing for some mbufs (for example being
> > > sent and awaiting for completion), so app should just wait
> > > - wait till all mbufs are returned to the memory pool (by monitoring
> > > available obj == pool size)
> > >
> > > Otherwise - it is dangerous to free the memory. There are just some
> > > mbufs still allocated, it is regardless to buf address to MR
> > > translation. We just can't free the memory - the mapping will be
> > > destroyed and might cause the segmentation fault by SW or some HW
> > > issues on DMA access to unmapped memory.  It is very generic safety
> > > approach - do not free the memory that is still in use. Hence, at
> > > the moment of freeing and unregistering the MR, there MUST BE NO any
> > mbufs in flight referencing to the addresses being freed.
> > > No translation to MR being invalidated can happen.
> > >
> > > >
> > > > > For other side, the cache flush has negative effect - the local
> > > > > cache is getting empty and can't provide translation for other
> > > > > valid (not being removed) MRs, and the translation has to look
> > > > > up in the global cache, that is locked now for rebuilding, this
> > > > > causes the delays in datapatch
> > > > on acquiring global cache lock.
> > > > > So, I see some potential performance impact.
> > > >
> > > > If above assumption is true, we can go to your second point. I
> > > > think this is a problem of the tradeoff between cache coherence
> > > > and
> > > performance.
> > > >
> > > > I can understand your meaning that though global cache has been
> > > > changed, we should keep the valid MR in local cache as long as
> > > > possible to ensure the fast searching speed.
> > > > In the meanwhile, the local cache can be rebuilt later to reduce
> > > > its waiting time for acquiring the global cache lock.
> > > >
> > > > However,  this mechanism just ensures the performance unchanged
> > > > for the first few mbufs.
> > > > During the next mbufs lkey searching after 'dev_gen' updated, it
> > > > is still necessary to update the local cache. And the performance
> > > > can firstly reduce and then returns. Thus, no matter whether there
> > > > is this patch or not,  the performance will jitter in a certain period of
> time.
> > >
> > > Local cache should be updated to remove MRs no longer valid. But we
> > > just flush the entire cache.
> > > Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
> > > And there are traffic in the datapath for MR0 and MR1, and no
> > > traffic for MRX anymore.
> > >
> > > 1) If we do as you propose:
> > > a) take a lock
> > > b) request flush local cache first - all MR0, MR1, MRX will be
> > > removed on translation in datapath
> > > c) update global cache,
> > > d) free lock
> > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on lock
> > > taken for cache update since point b) till point d).
> > >
> > > 2) If we do as it is implemented now:
> > > a) take a lock
> > > b) update global cache
> > > c) request flush local cache
> > > d) free lock
> > > The traffic MIGHT be locked ONLY for MRs non-existing in local cache
> > > (not happens for MR0 and MR1, must not happen for MRX),  and
> > > probability should be minor. And lock might happen since c) till d)
> > > - quite short period of time
> > >
> > > Summary, the difference between 1) and 2)
> > >
> > > Lock probability:
> > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > >
> > > Lock duration:
> > > - 1) lock since b) till d),
> > >   2) lock since c) till d), that seems to be  much shorter.
> > >
> > > >
> > > > Finally, in conclusion, I tend to think that the bottom layer can
> > > > do more things to ensure the correct execution of the program,
> > > > which may have a negative impact on the performance in a short
> > > > time, but in the long run, the performance will eventually come back.
> > > > Furthermore, maybe we should pay attention to the performance in
> > > > the stable period, and try our best to ensure the correctness of
> > > > the program in case of
> > > emergencies.
> > >
> > > If we have some mbufs still allocated in memory being freed - there
> > > is nothing to say about correctness, it is totally incorrect. In my
> > > opinion, we should not think how to mitigate this incorrect
> > > behavior, we should not encourage application developers to follow
> > > the wrong
> > approaches.
> > >
> > > With best regards,
> > > Slava
> > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > <shahafs@nvidia.com>; Slava Ovsiienko
> > > > > > <viacheslavo@nvidia.com>; Yongseok Koh <yskoh@mellanox.com>
> > > > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> > > <feifei.wang2@arm.com>;
> > > > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > 'dev_gen' is a variable to inform other cores to flush their
> > > > > > local cache when global cache is rebuilt.
> > > > > >
> > > > > > However, if 'dev_gen' is updated after global cache is
> > > > > > rebuilt, other cores may load a wrong memory region lkey value
> > > > > > from old local
> > > > cache.
> > > > > >
> > > > > > Timeslot        main core               worker core
> > > > > >   1         rebuild global cache
> > > > > >   2                                  load unchanged dev_gen
> > > > > >   3            update dev_gen
> > > > > >   4                                  look up old local cache
> > > > > >
> > > > > > From the example above, we can see that though global cache is
> > > > > > rebuilt, due to that dev_gen is not updated, the worker core
> > > > > > may look up old cache table and receive a wrong memory region
> lkey value.
> > > > > >
> > > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > > rebuilding global cache to inform worker cores to flush their
> > > > > > local cache when global cache start rebuilding. And wmb can
> > > > > > ensure the sequence of this
> > > > > process.
> > > > > >
> > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region
> > > > > > support")
> > > > > > Cc: stable@dpdk.org
> > > > > >
> > > > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > ---
> > > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > > +++++++++++++++++--------------------
> > > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > > da4e91fc2..7ce1d3e64 100644
> > > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > > > mlx5_dev_ctx_shared *sh,
> > > > > >  		rebuild = 1;
> > > > > >  	}
> > > > > >  	if (rebuild) {
> > > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > +		++sh->share_cache.dev_gen;
> > > > > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > +			sh->share_cache.dev_gen);
> > > > > > +
> > > > > >  		/*
> > > > > >  		 * Flush local caches by propagating invalidation
> > across cores.
> > > > > > -		 * rte_smp_wmb() is enough to synchronize this
> > event. If
> > > > > > one of
> > > > > > -		 * freed memsegs is seen by other core, that means
> > the
> > > > > > memseg
> > > > > > -		 * has been allocated by allocator, which will come
> > after this
> > > > > > -		 * free call. Therefore, this store instruction
> > (incrementing
> > > > > > -		 * generation below) will be guaranteed to be seen
> > by other
> > > > > > core
> > > > > > -		 * before the core sees the newly allocated memory.
> > > > > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > > updated before
> > > > > > +		 * rebuilding global cache. Therefore, other core can
> > flush
> > > > > > their
> > > > > > +		 * local cache on time.
> > > > > >  		 */
> > > > > > -		++sh->share_cache.dev_gen;
> > > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > -		      sh->share_cache.dev_gen);
> > > > > >  		rte_smp_wmb();
> > > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > >  	}
> > > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > > >  }
> > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> > > > *pdev,
> > > > > void
> > > > > > *addr,
> > > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> > >port_id,
> > > > > >  	      (void *)mr);
> > > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > +
> > > > > > +	++sh->share_cache.dev_gen;
> > > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > +		sh->share_cache.dev_gen);
> > > > > > +
> > > > > >  	/*
> > > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> > one of
> > > > > > -	 * freed memsegs is seen by other core, that means the
> > memseg
> > > > > > -	 * has been allocated by allocator, which will come after this
> > > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > > -	 * generation below) will be guaranteed to be seen by other
> > core
> > > > > > -	 * before the core sees the newly allocated memory.
> > > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> > updated
> > > > > > before
> > > > > > +	 * rebuilding global cache. Therefore, other core can flush
> > their
> > > > > > +	 * local cache on time.
> > > > > >  	 */
> > > > > > -	++sh->share_cache.dev_gen;
> > > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > -	      sh->share_cache.dev_gen);
> > > > > >  	rte_smp_wmb();
> > > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > > >  	return 0;
> > > > > >  }
> > > > > > --
> > > > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-20  7:53             ` [dpdk-dev] " Slava Ovsiienko
@ 2021-04-20  8:42               ` Feifei Wang
  2021-05-06  2:52                 ` Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-04-20  8:42 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Slava

I think the second wmb can be removed. 
As I know, wmb is just a barrier to keep the order between write and write. 
and it cannot tell the CPU when it should commit the changes.

It is usually used before guard variable to keep the order that updating guard variable after
some changes, which you want to release, have been done.

For example, for the wmb  after global cache update/before altering dev_gen, it can ensure
the order that updating global cache before altering dev_gen:
1)If other agent load the changed "dev_gen", it can know the global cache has been updated.
2)If other agents load the unchanged, "dev_gen", it means the global cache has not been updated, 
and the local cache will not be flushed. 

As a result, we use  wmb and guard variable "dev_gen" to ensure the global cache updating is "visible".
The "visible" means when updating guard variable "dev_gen" is known by other agents, they also can
confirm global cache has been updated in  the meanwhile. Thus, just one wmb before altering  dev_gen
can ensure this.

Best Regards
Feifei 

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年4月20日 15:54
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> In my opinion, there should be 2 barriers:
>  - after global cache update/before altering dev_gen, to ensure the correct
> order
>  - after altering dev_gen to make this change visible for other agents and to
> trigger local cache update
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Feifei Wang <Feifei.Wang2@arm.com>
> > Sent: Tuesday, April 20, 2021 10:30
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> > <nd@arm.com>
> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region cache
> >
> > Hi, Slava
> >
> > Another question suddenly occurred to me, in order to keep the order
> > that rebuilding global cache before updating ”dev_gen“, the wmb should
> > be before updating "dev_gen" rather than after it.
> > Otherwise, in the out-of-order platforms, current order cannot be kept.
> >
> > Thus, we should change the code as:
> > a) rebuild global cache;
> > b) rte_smp_wmb();
> > c) updating dev_gen
> >
> > Best Regards
> > Feifei
> > > -----邮件原件-----
> > > 发件人: Feifei Wang
> > > 发送时间: 2021年4月20日 13:54
> > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Slava
> > >
> > > Thanks very much for your explanation.
> > >
> > > I can understand the app can wait all mbufs are returned to the
> > > memory pool, and then it can free this mbufs, I agree with this.
> > >
> > > As a result, I will remove the bug fix patch from this series and
> > > just replace the smp barrier with C11 thread fence. Thanks very much
> > > for your patient explanation again.
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > -----邮件原件-----
> > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > 发送时间: 2021年4月20日 2:51
> > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > > cache
> > > >
> > > > Hi, Feifei
> > > >
> > > > Please, see below
> > > >
> > > > ....
> > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > Sorry, I do not follow what this patch fixes. Do we have some
> > > > > > issue/bug with MR cache in practice?
> > > > >
> > > > > This patch fixes the bug which is based on logical deduction,
> > > > > and it doesn't actually happen.
> > > > >
> > > > > >
> > > > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > > > convert buffer address in mbufs being transmitted to LKeys
> > > > > > (HW-related entity
> > > > > > handle) and the "global" cache for all MR registered on the device.
> > > > > >
> > > > > > AFAIK, how conversion happens in datapath:
> > > > > > - check the local queue cache flush request
> > > > > > - lookup in local cache
> > > > > > - if not found:
> > > > > > - acquire lock for global cache read access
> > > > > > - lookup in global cache
> > > > > > - release lock for global cache
> > > > > >
> > > > > > How cache update on memory freeing/unregistering happens:
> > > > > > - acquire lock for global cache write access
> > > > > > - [a] remove relevant MRs from the global cache
> > > > > > - [b] set local caches flush request
> > > > > > - free global cache lock
> > > > > >
> > > > > > If I understand correctly, your patch swaps [a] and [b], and
> > > > > > local caches flush is requested earlier. What problem does it solve?
> > > > > > It is not supposed there are in datapath some mbufs
> > > > > > referencing to the memory being freed. Application must ensure
> > > > > > this and must not allocate new mbufs from this memory regions
> being freed.
> > > > > > Hence, the lookups for these MRs in caches should not occur.
> > > > >
> > > > > For your first point that, application can take charge of
> > > > > preventing MR freed memory being allocated to data path.
> > > > >
> > > > > Does it means that If there is an emergency of MR fragment, such
> > > > > as hotplug, the application must inform thedata path in advance,
> > > > > and this memory will not be allocated, and then the control path
> > > > > will free this memory? If application  can do like this, I agree
> > > > > that this bug
> > > cannot happen.
> > > >
> > > > Actually,  this is the only correct way for application to operate.
> > > > Let's suppose we have some memory area that application wants to
> free.
> > > > ALL references to this area must be removed. If we have some mbufs
> > > > allocated from this area, it means that we have memory pool
> > > > created
> > there.
> > > >
> > > > What application should do:
> > > > - notify all its components/agents the memory area is going to be
> > > > freed
> > > > - all components/agents free the mbufs they might own
> > > > - PMD might not support freeing for some mbufs (for example being
> > > > sent and awaiting for completion), so app should just wait
> > > > - wait till all mbufs are returned to the memory pool (by
> > > > monitoring available obj == pool size)
> > > >
> > > > Otherwise - it is dangerous to free the memory. There are just
> > > > some mbufs still allocated, it is regardless to buf address to MR
> > > > translation. We just can't free the memory - the mapping will be
> > > > destroyed and might cause the segmentation fault by SW or some HW
> > > > issues on DMA access to unmapped memory.  It is very generic
> > > > safety approach - do not free the memory that is still in use.
> > > > Hence, at the moment of freeing and unregistering the MR, there
> > > > MUST BE NO any
> > > mbufs in flight referencing to the addresses being freed.
> > > > No translation to MR being invalidated can happen.
> > > >
> > > > >
> > > > > > For other side, the cache flush has negative effect - the
> > > > > > local cache is getting empty and can't provide translation for
> > > > > > other valid (not being removed) MRs, and the translation has
> > > > > > to look up in the global cache, that is locked now for
> > > > > > rebuilding, this causes the delays in datapatch
> > > > > on acquiring global cache lock.
> > > > > > So, I see some potential performance impact.
> > > > >
> > > > > If above assumption is true, we can go to your second point. I
> > > > > think this is a problem of the tradeoff between cache coherence
> > > > > and
> > > > performance.
> > > > >
> > > > > I can understand your meaning that though global cache has been
> > > > > changed, we should keep the valid MR in local cache as long as
> > > > > possible to ensure the fast searching speed.
> > > > > In the meanwhile, the local cache can be rebuilt later to reduce
> > > > > its waiting time for acquiring the global cache lock.
> > > > >
> > > > > However,  this mechanism just ensures the performance unchanged
> > > > > for the first few mbufs.
> > > > > During the next mbufs lkey searching after 'dev_gen' updated, it
> > > > > is still necessary to update the local cache. And the
> > > > > performance can firstly reduce and then returns. Thus, no matter
> > > > > whether there is this patch or not,  the performance will jitter
> > > > > in a certain period of
> > time.
> > > >
> > > > Local cache should be updated to remove MRs no longer valid. But
> > > > we just flush the entire cache.
> > > > Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
> > > > And there are traffic in the datapath for MR0 and MR1, and no
> > > > traffic for MRX anymore.
> > > >
> > > > 1) If we do as you propose:
> > > > a) take a lock
> > > > b) request flush local cache first - all MR0, MR1, MRX will be
> > > > removed on translation in datapath
> > > > c) update global cache,
> > > > d) free lock
> > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on lock
> > > > taken for cache update since point b) till point d).
> > > >
> > > > 2) If we do as it is implemented now:
> > > > a) take a lock
> > > > b) update global cache
> > > > c) request flush local cache
> > > > d) free lock
> > > > The traffic MIGHT be locked ONLY for MRs non-existing in local
> > > > cache (not happens for MR0 and MR1, must not happen for MRX),  and
> > > > probability should be minor. And lock might happen since c) till
> > > > d)
> > > > - quite short period of time
> > > >
> > > > Summary, the difference between 1) and 2)
> > > >
> > > > Lock probability:
> > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > >
> > > > Lock duration:
> > > > - 1) lock since b) till d),
> > > >   2) lock since c) till d), that seems to be  much shorter.
> > > >
> > > > >
> > > > > Finally, in conclusion, I tend to think that the bottom layer
> > > > > can do more things to ensure the correct execution of the
> > > > > program, which may have a negative impact on the performance in
> > > > > a short time, but in the long run, the performance will eventually
> come back.
> > > > > Furthermore, maybe we should pay attention to the performance in
> > > > > the stable period, and try our best to ensure the correctness of
> > > > > the program in case of
> > > > emergencies.
> > > >
> > > > If we have some mbufs still allocated in memory being freed -
> > > > there is nothing to say about correctness, it is totally
> > > > incorrect. In my opinion, we should not think how to mitigate this
> > > > incorrect behavior, we should not encourage application developers
> > > > to follow the wrong
> > > approaches.
> > > >
> > > > With best regards,
> > > > Slava
> > > >
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > <shahafs@nvidia.com>; Slava Ovsiienko
> > > > > > > <viacheslavo@nvidia.com>; Yongseok Koh <yskoh@mellanox.com>
> > > > > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> > > > <feifei.wang2@arm.com>;
> > > > > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > > Region cache
> > > > > > >
> > > > > > > 'dev_gen' is a variable to inform other cores to flush their
> > > > > > > local cache when global cache is rebuilt.
> > > > > > >
> > > > > > > However, if 'dev_gen' is updated after global cache is
> > > > > > > rebuilt, other cores may load a wrong memory region lkey
> > > > > > > value from old local
> > > > > cache.
> > > > > > >
> > > > > > > Timeslot        main core               worker core
> > > > > > >   1         rebuild global cache
> > > > > > >   2                                  load unchanged dev_gen
> > > > > > >   3            update dev_gen
> > > > > > >   4                                  look up old local cache
> > > > > > >
> > > > > > > From the example above, we can see that though global cache
> > > > > > > is rebuilt, due to that dev_gen is not updated, the worker
> > > > > > > core may look up old cache table and receive a wrong memory
> > > > > > > region
> > lkey value.
> > > > > > >
> > > > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > > > rebuilding global cache to inform worker cores to flush
> > > > > > > their local cache when global cache start rebuilding. And
> > > > > > > wmb can ensure the sequence of this
> > > > > > process.
> > > > > > >
> > > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region
> > > > > > > support")
> > > > > > > Cc: stable@dpdk.org
> > > > > > >
> > > > > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > ---
> > > > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > > > +++++++++++++++++--------------------
> > > > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > > > da4e91fc2..7ce1d3e64 100644
> > > > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > > > > mlx5_dev_ctx_shared *sh,
> > > > > > >  		rebuild = 1;
> > > > > > >  	}
> > > > > > >  	if (rebuild) {
> > > > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > +		++sh->share_cache.dev_gen;
> > > > > > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > +			sh->share_cache.dev_gen);
> > > > > > > +
> > > > > > >  		/*
> > > > > > >  		 * Flush local caches by propagating invalidation
> > > across cores.
> > > > > > > -		 * rte_smp_wmb() is enough to synchronize this
> > > event. If
> > > > > > > one of
> > > > > > > -		 * freed memsegs is seen by other core, that means
> > > the
> > > > > > > memseg
> > > > > > > -		 * has been allocated by allocator, which will come
> > > after this
> > > > > > > -		 * free call. Therefore, this store instruction
> > > (incrementing
> > > > > > > -		 * generation below) will be guaranteed to be seen
> > > by other
> > > > > > > core
> > > > > > > -		 * before the core sees the newly allocated memory.
> > > > > > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > > > updated before
> > > > > > > +		 * rebuilding global cache. Therefore, other core can
> > > flush
> > > > > > > their
> > > > > > > +		 * local cache on time.
> > > > > > >  		 */
> > > > > > > -		++sh->share_cache.dev_gen;
> > > > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > -		      sh->share_cache.dev_gen);
> > > > > > >  		rte_smp_wmb();
> > > > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > >  	}
> > > > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > > > >  }
> > > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> > > > > *pdev,
> > > > > > void
> > > > > > > *addr,
> > > > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> > > >port_id,
> > > > > > >  	      (void *)mr);
> > > > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > +
> > > > > > > +	++sh->share_cache.dev_gen;
> > > > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > +		sh->share_cache.dev_gen);
> > > > > > > +
> > > > > > >  	/*
> > > > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> > > one of
> > > > > > > -	 * freed memsegs is seen by other core, that means the
> > > memseg
> > > > > > > -	 * has been allocated by allocator, which will come after this
> > > > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > > > -	 * generation below) will be guaranteed to be seen by other
> > > core
> > > > > > > -	 * before the core sees the newly allocated memory.
> > > > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> > > updated
> > > > > > > before
> > > > > > > +	 * rebuilding global cache. Therefore, other core can
> > > > > > > +flush
> > > their
> > > > > > > +	 * local cache on time.
> > > > > > >  	 */
> > > > > > > -	++sh->share_cache.dev_gen;
> > > > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > -	      sh->share_cache.dev_gen);
> > > > > > >  	rte_smp_wmb();
> > > > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > > > >  	return 0;
> > > > > > >  }
> > > > > > > --
> > > > > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-04-20  8:42               ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-05-06  2:52                 ` Feifei Wang
  2021-05-06 11:21                   ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-06  2:52 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Slava

Would you have more comments about this patch? 
For my sight, only one wmb before "dev_gen" updating is enough to synchronize.

Thanks very much for your attention.


Best Regards
Feifei

> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: 2021年4月20日 16:42
> 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> I think the second wmb can be removed.
> As I know, wmb is just a barrier to keep the order between write and write.
> and it cannot tell the CPU when it should commit the changes.
> 
> It is usually used before guard variable to keep the order that updating guard
> variable after some changes, which you want to release, have been done.
> 
> For example, for the wmb  after global cache update/before altering
> dev_gen, it can ensure the order that updating global cache before altering
> dev_gen:
> 1)If other agent load the changed "dev_gen", it can know the global cache
> has been updated.
> 2)If other agents load the unchanged, "dev_gen", it means the global cache
> has not been updated, and the local cache will not be flushed.
> 
> As a result, we use  wmb and guard variable "dev_gen" to ensure the global
> cache updating is "visible".
> The "visible" means when updating guard variable "dev_gen" is known by
> other agents, they also can confirm global cache has been updated in  the
> meanwhile. Thus, just one wmb before altering  dev_gen can ensure this.
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > 发送时间: 2021年4月20日 15:54
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> > <nd@arm.com>
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > In my opinion, there should be 2 barriers:
> >  - after global cache update/before altering dev_gen, to ensure the
> > correct order
> >  - after altering dev_gen to make this change visible for other agents
> > and to trigger local cache update
> >
> > With best regards,
> > Slava
> >
> > > -----Original Message-----
> > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > Sent: Tuesday, April 20, 2021 10:30
> > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> > > <nd@arm.com>
> > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region cache
> > >
> > > Hi, Slava
> > >
> > > Another question suddenly occurred to me, in order to keep the order
> > > that rebuilding global cache before updating ”dev_gen“, the wmb
> > > should be before updating "dev_gen" rather than after it.
> > > Otherwise, in the out-of-order platforms, current order cannot be kept.
> > >
> > > Thus, we should change the code as:
> > > a) rebuild global cache;
> > > b) rte_smp_wmb();
> > > c) updating dev_gen
> > >
> > > Best Regards
> > > Feifei
> > > > -----邮件原件-----
> > > > 发件人: Feifei Wang
> > > > 发送时间: 2021年4月20日 13:54
> > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> Region
> > > > cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Thanks very much for your explanation.
> > > >
> > > > I can understand the app can wait all mbufs are returned to the
> > > > memory pool, and then it can free this mbufs, I agree with this.
> > > >
> > > > As a result, I will remove the bug fix patch from this series and
> > > > just replace the smp barrier with C11 thread fence. Thanks very
> > > > much for your patient explanation again.
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > 发送时间: 2021年4月20日 2:51
> > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Feifei
> > > > >
> > > > > Please, see below
> > > > >
> > > > > ....
> > > > >
> > > > > > > Hi, Feifei
> > > > > > >
> > > > > > > Sorry, I do not follow what this patch fixes. Do we have
> > > > > > > some issue/bug with MR cache in practice?
> > > > > >
> > > > > > This patch fixes the bug which is based on logical deduction,
> > > > > > and it doesn't actually happen.
> > > > > >
> > > > > > >
> > > > > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > > > > convert buffer address in mbufs being transmitted to LKeys
> > > > > > > (HW-related entity
> > > > > > > handle) and the "global" cache for all MR registered on the device.
> > > > > > >
> > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > - check the local queue cache flush request
> > > > > > > - lookup in local cache
> > > > > > > - if not found:
> > > > > > > - acquire lock for global cache read access
> > > > > > > - lookup in global cache
> > > > > > > - release lock for global cache
> > > > > > >
> > > > > > > How cache update on memory freeing/unregistering happens:
> > > > > > > - acquire lock for global cache write access
> > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > - [b] set local caches flush request
> > > > > > > - free global cache lock
> > > > > > >
> > > > > > > If I understand correctly, your patch swaps [a] and [b], and
> > > > > > > local caches flush is requested earlier. What problem does it solve?
> > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > referencing to the memory being freed. Application must
> > > > > > > ensure this and must not allocate new mbufs from this memory
> > > > > > > regions
> > being freed.
> > > > > > > Hence, the lookups for these MRs in caches should not occur.
> > > > > >
> > > > > > For your first point that, application can take charge of
> > > > > > preventing MR freed memory being allocated to data path.
> > > > > >
> > > > > > Does it means that If there is an emergency of MR fragment,
> > > > > > such as hotplug, the application must inform thedata path in
> > > > > > advance, and this memory will not be allocated, and then the
> > > > > > control path will free this memory? If application  can do
> > > > > > like this, I agree that this bug
> > > > cannot happen.
> > > > >
> > > > > Actually,  this is the only correct way for application to operate.
> > > > > Let's suppose we have some memory area that application wants to
> > free.
> > > > > ALL references to this area must be removed. If we have some
> > > > > mbufs allocated from this area, it means that we have memory
> > > > > pool created
> > > there.
> > > > >
> > > > > What application should do:
> > > > > - notify all its components/agents the memory area is going to
> > > > > be freed
> > > > > - all components/agents free the mbufs they might own
> > > > > - PMD might not support freeing for some mbufs (for example
> > > > > being sent and awaiting for completion), so app should just wait
> > > > > - wait till all mbufs are returned to the memory pool (by
> > > > > monitoring available obj == pool size)
> > > > >
> > > > > Otherwise - it is dangerous to free the memory. There are just
> > > > > some mbufs still allocated, it is regardless to buf address to
> > > > > MR translation. We just can't free the memory - the mapping will
> > > > > be destroyed and might cause the segmentation fault by SW or
> > > > > some HW issues on DMA access to unmapped memory.  It is very
> > > > > generic safety approach - do not free the memory that is still in use.
> > > > > Hence, at the moment of freeing and unregistering the MR, there
> > > > > MUST BE NO any
> > > > mbufs in flight referencing to the addresses being freed.
> > > > > No translation to MR being invalidated can happen.
> > > > >
> > > > > >
> > > > > > > For other side, the cache flush has negative effect - the
> > > > > > > local cache is getting empty and can't provide translation
> > > > > > > for other valid (not being removed) MRs, and the translation
> > > > > > > has to look up in the global cache, that is locked now for
> > > > > > > rebuilding, this causes the delays in datapatch
> > > > > > on acquiring global cache lock.
> > > > > > > So, I see some potential performance impact.
> > > > > >
> > > > > > If above assumption is true, we can go to your second point. I
> > > > > > think this is a problem of the tradeoff between cache
> > > > > > coherence and
> > > > > performance.
> > > > > >
> > > > > > I can understand your meaning that though global cache has
> > > > > > been changed, we should keep the valid MR in local cache as
> > > > > > long as possible to ensure the fast searching speed.
> > > > > > In the meanwhile, the local cache can be rebuilt later to
> > > > > > reduce its waiting time for acquiring the global cache lock.
> > > > > >
> > > > > > However,  this mechanism just ensures the performance
> > > > > > unchanged for the first few mbufs.
> > > > > > During the next mbufs lkey searching after 'dev_gen' updated,
> > > > > > it is still necessary to update the local cache. And the
> > > > > > performance can firstly reduce and then returns. Thus, no
> > > > > > matter whether there is this patch or not,  the performance
> > > > > > will jitter in a certain period of
> > > time.
> > > > >
> > > > > Local cache should be updated to remove MRs no longer valid. But
> > > > > we just flush the entire cache.
> > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in local
> cache.
> > > > > And there are traffic in the datapath for MR0 and MR1, and no
> > > > > traffic for MRX anymore.
> > > > >
> > > > > 1) If we do as you propose:
> > > > > a) take a lock
> > > > > b) request flush local cache first - all MR0, MR1, MRX will be
> > > > > removed on translation in datapath
> > > > > c) update global cache,
> > > > > d) free lock
> > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on
> > > > > lock taken for cache update since point b) till point d).
> > > > >
> > > > > 2) If we do as it is implemented now:
> > > > > a) take a lock
> > > > > b) update global cache
> > > > > c) request flush local cache
> > > > > d) free lock
> > > > > The traffic MIGHT be locked ONLY for MRs non-existing in local
> > > > > cache (not happens for MR0 and MR1, must not happen for MRX),
> > > > > and probability should be minor. And lock might happen since c)
> > > > > till
> > > > > d)
> > > > > - quite short period of time
> > > > >
> > > > > Summary, the difference between 1) and 2)
> > > > >
> > > > > Lock probability:
> > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > >
> > > > > Lock duration:
> > > > > - 1) lock since b) till d),
> > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > >
> > > > > >
> > > > > > Finally, in conclusion, I tend to think that the bottom layer
> > > > > > can do more things to ensure the correct execution of the
> > > > > > program, which may have a negative impact on the performance
> > > > > > in a short time, but in the long run, the performance will
> > > > > > eventually
> > come back.
> > > > > > Furthermore, maybe we should pay attention to the performance
> > > > > > in the stable period, and try our best to ensure the
> > > > > > correctness of the program in case of
> > > > > emergencies.
> > > > >
> > > > > If we have some mbufs still allocated in memory being freed -
> > > > > there is nothing to say about correctness, it is totally
> > > > > incorrect. In my opinion, we should not think how to mitigate
> > > > > this incorrect behavior, we should not encourage application
> > > > > developers to follow the wrong
> > > > approaches.
> > > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > With best regards,
> > > > > > > Slava
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > <shahafs@nvidia.com>; Slava Ovsiienko
> > > > > > > > <viacheslavo@nvidia.com>; Yongseok Koh
> > > > > > > > <yskoh@mellanox.com>
> > > > > > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> > > > > <feifei.wang2@arm.com>;
> > > > > > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > 'dev_gen' is a variable to inform other cores to flush
> > > > > > > > their local cache when global cache is rebuilt.
> > > > > > > >
> > > > > > > > However, if 'dev_gen' is updated after global cache is
> > > > > > > > rebuilt, other cores may load a wrong memory region lkey
> > > > > > > > value from old local
> > > > > > cache.
> > > > > > > >
> > > > > > > > Timeslot        main core               worker core
> > > > > > > >   1         rebuild global cache
> > > > > > > >   2                                  load unchanged dev_gen
> > > > > > > >   3            update dev_gen
> > > > > > > >   4                                  look up old local cache
> > > > > > > >
> > > > > > > > From the example above, we can see that though global
> > > > > > > > cache is rebuilt, due to that dev_gen is not updated, the
> > > > > > > > worker core may look up old cache table and receive a
> > > > > > > > wrong memory region
> > > lkey value.
> > > > > > > >
> > > > > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > > > > rebuilding global cache to inform worker cores to flush
> > > > > > > > their local cache when global cache start rebuilding. And
> > > > > > > > wmb can ensure the sequence of this
> > > > > > > process.
> > > > > > > >
> > > > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region
> > > > > > > > support")
> > > > > > > > Cc: stable@dpdk.org
> > > > > > > >
> > > > > > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > ---
> > > > > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > > > > +++++++++++++++++--------------------
> > > > > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > > > > da4e91fc2..7ce1d3e64 100644
> > > > > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > > > > > mlx5_dev_ctx_shared *sh,
> > > > > > > >  		rebuild = 1;
> > > > > > > >  	}
> > > > > > > >  	if (rebuild) {
> > > > > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > +		++sh->share_cache.dev_gen;
> > > > > > > > +		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > +			sh->share_cache.dev_gen);
> > > > > > > > +
> > > > > > > >  		/*
> > > > > > > >  		 * Flush local caches by propagating invalidation
> > > > across cores.
> > > > > > > > -		 * rte_smp_wmb() is enough to synchronize this
> > > > event. If
> > > > > > > > one of
> > > > > > > > -		 * freed memsegs is seen by other core, that means
> > > > the
> > > > > > > > memseg
> > > > > > > > -		 * has been allocated by allocator, which will come
> > > > after this
> > > > > > > > -		 * free call. Therefore, this store instruction
> > > > (incrementing
> > > > > > > > -		 * generation below) will be guaranteed to be seen
> > > > by other
> > > > > > > > core
> > > > > > > > -		 * before the core sees the newly allocated memory.
> > > > > > > > +		 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > > > > updated before
> > > > > > > > +		 * rebuilding global cache. Therefore, other core can
> > > > flush
> > > > > > > > their
> > > > > > > > +		 * local cache on time.
> > > > > > > >  		 */
> > > > > > > > -		++sh->share_cache.dev_gen;
> > > > > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > -		      sh->share_cache.dev_gen);
> > > > > > > >  		rte_smp_wmb();
> > > > > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > >  	}
> > > > > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > > > > >  }
> > > > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct
> rte_pci_device
> > > > > > *pdev,
> > > > > > > void
> > > > > > > > *addr,
> > > > > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> > > > >port_id,
> > > > > > > >  	      (void *)mr);
> > > > > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > +
> > > > > > > > +	++sh->share_cache.dev_gen;
> > > > > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > +		sh->share_cache.dev_gen);
> > > > > > > > +
> > > > > > > >  	/*
> > > > > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> > > > one of
> > > > > > > > -	 * freed memsegs is seen by other core, that means the
> > > > memseg
> > > > > > > > -	 * has been allocated by allocator, which will come after this
> > > > > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > > > > -	 * generation below) will be guaranteed to be seen by other
> > > > core
> > > > > > > > -	 * before the core sees the newly allocated memory.
> > > > > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> > > > updated
> > > > > > > > before
> > > > > > > > +	 * rebuilding global cache. Therefore, other core can
> > > > > > > > +flush
> > > > their
> > > > > > > > +	 * local cache on time.
> > > > > > > >  	 */
> > > > > > > > -	++sh->share_cache.dev_gen;
> > > > > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > -	      sh->share_cache.dev_gen);
> > > > > > > >  	rte_smp_wmb();
> > > > > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > > > > >  	return 0;
> > > > > > > >  }
> > > > > > > > --
> > > > > > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-06  2:52                 ` Feifei Wang
@ 2021-05-06 11:21                   ` Slava Ovsiienko
  2021-05-07  6:36                     ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-06 11:21 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Feifei

Sorry, I do not follow why we should get rid of the last (after dev_gen update) wmb.
We've rebuilt the global cache, we should notify other agents it's happened and
they should flush local caches. So, dev_gen change should be made visible
to other agents to trigger this activity and the second wmb is here to ensure this.

One more point, due to registering new/destroying existing MR involves FW (via kernel) calls,
it takes so many CPU cycles that we could neglect wmb overhead at all.

Also, regarding this:

 > > Another question suddenly occurred to me, in order to keep the
 > > order that rebuilding global cache before updating ”dev_gen“, the
 > > wmb should be before updating "dev_gen" rather than after it.
 > > Otherwise, in the out-of-order platforms, current order cannot be kept.

it is not clear why ordering is important - global cache update and dev_gen change 
happen under spinlock protection, so only the last wmb  is meaningful.

To summarize, in my opinion:
- if you see some issue with ordering of global cache update/dev_gen signalling,
  could you, please, elaborate? I'm not sure we should maintain an order (due to spinlock protection)
- the last rte_smp_wmb() after dev_gen incrementing should be kept intact

With best regards,
Slava

> -----Original Message-----
> From: Feifei Wang <Feifei.Wang2@arm.com>
> Sent: Thursday, May 6, 2021 5:52
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Would you have more comments about this patch?
> For my sight, only one wmb before "dev_gen" updating is enough to
> synchronize.
> 
> Thanks very much for your attention.
> 
> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Feifei Wang
> > 发送时间: 2021年4月20日 16:42
> > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Slava
> >
> > I think the second wmb can be removed.
> > As I know, wmb is just a barrier to keep the order between write and write.
> > and it cannot tell the CPU when it should commit the changes.
> >
> > It is usually used before guard variable to keep the order that
> > updating guard variable after some changes, which you want to release,
> have been done.
> >
> > For example, for the wmb  after global cache update/before altering
> > dev_gen, it can ensure the order that updating global cache before
> > altering
> > dev_gen:
> > 1)If other agent load the changed "dev_gen", it can know the global
> > cache has been updated.
> > 2)If other agents load the unchanged, "dev_gen", it means the global
> > cache has not been updated, and the local cache will not be flushed.
> >
> > As a result, we use  wmb and guard variable "dev_gen" to ensure the
> > global cache updating is "visible".
> > The "visible" means when updating guard variable "dev_gen" is known by
> > other agents, they also can confirm global cache has been updated in
> > the meanwhile. Thus, just one wmb before altering  dev_gen can ensure
> this.
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > 发送时间: 2021年4月20日 15:54
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>; nd
> > > <nd@arm.com>
> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Feifei
> > >
> > > In my opinion, there should be 2 barriers:
> > >  - after global cache update/before altering dev_gen, to ensure the
> > > correct order
> > >  - after altering dev_gen to make this change visible for other
> > > agents and to trigger local cache update
> > >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > Sent: Tuesday, April 20, 2021 10:30
> > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>;
> nd
> > > > <nd@arm.com>
> > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Another question suddenly occurred to me, in order to keep the
> > > > order that rebuilding global cache before updating ”dev_gen“, the
> > > > wmb should be before updating "dev_gen" rather than after it.
> > > > Otherwise, in the out-of-order platforms, current order cannot be kept.
> > > >
> > > > Thus, we should change the code as:
> > > > a) rebuild global cache;
> > > > b) rte_smp_wmb();
> > > > c) updating dev_gen
> > > >
> > > > Best Regards
> > > > Feifei
> > > > > -----邮件原件-----
> > > > > 发件人: Feifei Wang
> > > > > 发送时间: 2021年4月20日 13:54
> > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region
> > > > > cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > Thanks very much for your explanation.
> > > > >
> > > > > I can understand the app can wait all mbufs are returned to the
> > > > > memory pool, and then it can free this mbufs, I agree with this.
> > > > >
> > > > > As a result, I will remove the bug fix patch from this series
> > > > > and just replace the smp barrier with C11 thread fence. Thanks
> > > > > very much for your patient explanation again.
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > Please, see below
> > > > > >
> > > > > > ....
> > > > > >
> > > > > > > > Hi, Feifei
> > > > > > > >
> > > > > > > > Sorry, I do not follow what this patch fixes. Do we have
> > > > > > > > some issue/bug with MR cache in practice?
> > > > > > >
> > > > > > > This patch fixes the bug which is based on logical
> > > > > > > deduction, and it doesn't actually happen.
> > > > > > >
> > > > > > > >
> > > > > > > > Each Tx queue has its own dedicated "local" cache for MRs
> > > > > > > > to convert buffer address in mbufs being transmitted to
> > > > > > > > LKeys (HW-related entity
> > > > > > > > handle) and the "global" cache for all MR registered on the
> device.
> > > > > > > >
> > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > - check the local queue cache flush request
> > > > > > > > - lookup in local cache
> > > > > > > > - if not found:
> > > > > > > > - acquire lock for global cache read access
> > > > > > > > - lookup in global cache
> > > > > > > > - release lock for global cache
> > > > > > > >
> > > > > > > > How cache update on memory freeing/unregistering happens:
> > > > > > > > - acquire lock for global cache write access
> > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > - [b] set local caches flush request
> > > > > > > > - free global cache lock
> > > > > > > >
> > > > > > > > If I understand correctly, your patch swaps [a] and [b],
> > > > > > > > and local caches flush is requested earlier. What problem does it
> solve?
> > > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > > referencing to the memory being freed. Application must
> > > > > > > > ensure this and must not allocate new mbufs from this
> > > > > > > > memory regions
> > > being freed.
> > > > > > > > Hence, the lookups for these MRs in caches should not occur.
> > > > > > >
> > > > > > > For your first point that, application can take charge of
> > > > > > > preventing MR freed memory being allocated to data path.
> > > > > > >
> > > > > > > Does it means that If there is an emergency of MR fragment,
> > > > > > > such as hotplug, the application must inform thedata path in
> > > > > > > advance, and this memory will not be allocated, and then the
> > > > > > > control path will free this memory? If application  can do
> > > > > > > like this, I agree that this bug
> > > > > cannot happen.
> > > > > >
> > > > > > Actually,  this is the only correct way for application to operate.
> > > > > > Let's suppose we have some memory area that application wants
> > > > > > to
> > > free.
> > > > > > ALL references to this area must be removed. If we have some
> > > > > > mbufs allocated from this area, it means that we have memory
> > > > > > pool created
> > > > there.
> > > > > >
> > > > > > What application should do:
> > > > > > - notify all its components/agents the memory area is going to
> > > > > > be freed
> > > > > > - all components/agents free the mbufs they might own
> > > > > > - PMD might not support freeing for some mbufs (for example
> > > > > > being sent and awaiting for completion), so app should just
> > > > > > wait
> > > > > > - wait till all mbufs are returned to the memory pool (by
> > > > > > monitoring available obj == pool size)
> > > > > >
> > > > > > Otherwise - it is dangerous to free the memory. There are just
> > > > > > some mbufs still allocated, it is regardless to buf address to
> > > > > > MR translation. We just can't free the memory - the mapping
> > > > > > will be destroyed and might cause the segmentation fault by SW
> > > > > > or some HW issues on DMA access to unmapped memory.  It is
> > > > > > very generic safety approach - do not free the memory that is still in
> use.
> > > > > > Hence, at the moment of freeing and unregistering the MR,
> > > > > > there MUST BE NO any
> > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > No translation to MR being invalidated can happen.
> > > > > >
> > > > > > >
> > > > > > > > For other side, the cache flush has negative effect - the
> > > > > > > > local cache is getting empty and can't provide translation
> > > > > > > > for other valid (not being removed) MRs, and the
> > > > > > > > translation has to look up in the global cache, that is
> > > > > > > > locked now for rebuilding, this causes the delays in
> > > > > > > > datapatch
> > > > > > > on acquiring global cache lock.
> > > > > > > > So, I see some potential performance impact.
> > > > > > >
> > > > > > > If above assumption is true, we can go to your second point.
> > > > > > > I think this is a problem of the tradeoff between cache
> > > > > > > coherence and
> > > > > > performance.
> > > > > > >
> > > > > > > I can understand your meaning that though global cache has
> > > > > > > been changed, we should keep the valid MR in local cache as
> > > > > > > long as possible to ensure the fast searching speed.
> > > > > > > In the meanwhile, the local cache can be rebuilt later to
> > > > > > > reduce its waiting time for acquiring the global cache lock.
> > > > > > >
> > > > > > > However,  this mechanism just ensures the performance
> > > > > > > unchanged for the first few mbufs.
> > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > Thus, no matter whether there is this patch or not,  the
> > > > > > > performance will jitter in a certain period of
> > > > time.
> > > > > >
> > > > > > Local cache should be updated to remove MRs no longer valid.
> > > > > > But we just flush the entire cache.
> > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in
> > > > > > local
> > cache.
> > > > > > And there are traffic in the datapath for MR0 and MR1, and no
> > > > > > traffic for MRX anymore.
> > > > > >
> > > > > > 1) If we do as you propose:
> > > > > > a) take a lock
> > > > > > b) request flush local cache first - all MR0, MR1, MRX will be
> > > > > > removed on translation in datapath
> > > > > > c) update global cache,
> > > > > > d) free lock
> > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on
> > > > > > lock taken for cache update since point b) till point d).
> > > > > >
> > > > > > 2) If we do as it is implemented now:
> > > > > > a) take a lock
> > > > > > b) update global cache
> > > > > > c) request flush local cache
> > > > > > d) free lock
> > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in local
> > > > > > cache (not happens for MR0 and MR1, must not happen for MRX),
> > > > > > and probability should be minor. And lock might happen since
> > > > > > c) till
> > > > > > d)
> > > > > > - quite short period of time
> > > > > >
> > > > > > Summary, the difference between 1) and 2)
> > > > > >
> > > > > > Lock probability:
> > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > >
> > > > > > Lock duration:
> > > > > > - 1) lock since b) till d),
> > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > >
> > > > > > >
> > > > > > > Finally, in conclusion, I tend to think that the bottom
> > > > > > > layer can do more things to ensure the correct execution of
> > > > > > > the program, which may have a negative impact on the
> > > > > > > performance in a short time, but in the long run, the
> > > > > > > performance will eventually
> > > come back.
> > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > performance in the stable period, and try our best to ensure
> > > > > > > the correctness of the program in case of
> > > > > > emergencies.
> > > > > >
> > > > > > If we have some mbufs still allocated in memory being freed -
> > > > > > there is nothing to say about correctness, it is totally
> > > > > > incorrect. In my opinion, we should not think how to mitigate
> > > > > > this incorrect behavior, we should not encourage application
> > > > > > developers to follow the wrong
> > > > > approaches.
> > > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > With best regards,
> > > > > > > > Slava
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > > > Sent: Thursday, March 18, 2021 9:19
> > > > > > > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > <shahafs@nvidia.com>; Slava Ovsiienko
> > > > > > > > > <viacheslavo@nvidia.com>; Yongseok Koh
> > > > > > > > > <yskoh@mellanox.com>
> > > > > > > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> > > > > > <feifei.wang2@arm.com>;
> > > > > > > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory Region cache
> > > > > > > > >
> > > > > > > > > 'dev_gen' is a variable to inform other cores to flush
> > > > > > > > > their local cache when global cache is rebuilt.
> > > > > > > > >
> > > > > > > > > However, if 'dev_gen' is updated after global cache is
> > > > > > > > > rebuilt, other cores may load a wrong memory region lkey
> > > > > > > > > value from old local
> > > > > > > cache.
> > > > > > > > >
> > > > > > > > > Timeslot        main core               worker core
> > > > > > > > >   1         rebuild global cache
> > > > > > > > >   2                                  load unchanged dev_gen
> > > > > > > > >   3            update dev_gen
> > > > > > > > >   4                                  look up old local cache
> > > > > > > > >
> > > > > > > > > From the example above, we can see that though global
> > > > > > > > > cache is rebuilt, due to that dev_gen is not updated,
> > > > > > > > > the worker core may look up old cache table and receive
> > > > > > > > > a wrong memory region
> > > > lkey value.
> > > > > > > > >
> > > > > > > > > To fix this, updating 'dev_gen' should be moved before
> > > > > > > > > rebuilding global cache to inform worker cores to flush
> > > > > > > > > their local cache when global cache start rebuilding.
> > > > > > > > > And wmb can ensure the sequence of this
> > > > > > > > process.
> > > > > > > > >
> > > > > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region
> > > > > > > > > support")
> > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > >
> > > > > > > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/net/mlx5/mlx5_mr.c | 37
> > > > > > > > > +++++++++++++++++--------------------
> > > > > > > > >  1 file changed, 17 insertions(+), 20 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > > > > > > da4e91fc2..7ce1d3e64 100644
> > > > > > > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > > > > > > @@ -103,20 +103,18 @@
> mlx5_mr_mem_event_free_cb(struct
> > > > > > > > > mlx5_dev_ctx_shared *sh,
> > > > > > > > >  		rebuild = 1;
> > > > > > > > >  	}
> > > > > > > > >  	if (rebuild) {
> > > > > > > > > -		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > > +		++sh->share_cache.dev_gen;
> > > > > > > > > +		DEBUG("broadcasting local cache flush,
> gen=%d",
> > > > > > > > > +			sh->share_cache.dev_gen);
> > > > > > > > > +
> > > > > > > > >  		/*
> > > > > > > > >  		 * Flush local caches by propagating invalidation
> > > > > across cores.
> > > > > > > > > -		 * rte_smp_wmb() is enough to synchronize this
> > > > > event. If
> > > > > > > > > one of
> > > > > > > > > -		 * freed memsegs is seen by other core, that means
> > > > > the
> > > > > > > > > memseg
> > > > > > > > > -		 * has been allocated by allocator, which will come
> > > > > after this
> > > > > > > > > -		 * free call. Therefore, this store instruction
> > > > > (incrementing
> > > > > > > > > -		 * generation below) will be guaranteed to be seen
> > > > > by other
> > > > > > > > > core
> > > > > > > > > -		 * before the core sees the newly allocated memory.
> > > > > > > > > +		 * rte_smp_wmb() is to keep the order that
> dev_gen
> > > > > > > > > updated before
> > > > > > > > > +		 * rebuilding global cache. Therefore, other
> core can
> > > > > flush
> > > > > > > > > their
> > > > > > > > > +		 * local cache on time.
> > > > > > > > >  		 */
> > > > > > > > > -		++sh->share_cache.dev_gen;
> > > > > > > > > -		DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > -		      sh->share_cache.dev_gen);
> > > > > > > > >  		rte_smp_wmb();
> > > > > > > > > +		mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > >  	}
> > > > > > > > >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > > > > > >  }
> > > > > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct
> > rte_pci_device
> > > > > > > *pdev,
> > > > > > > > void
> > > > > > > > > *addr,
> > > > > > > > >  	mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > > > > > >  	DEBUG("port %u remove MR(%p) from list", dev->data-
> > > > > >port_id,
> > > > > > > > >  	      (void *)mr);
> > > > > > > > > -	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > > +
> > > > > > > > > +	++sh->share_cache.dev_gen;
> > > > > > > > > +	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > +		sh->share_cache.dev_gen);
> > > > > > > > > +
> > > > > > > > >  	/*
> > > > > > > > >  	 * Flush local caches by propagating invalidation across cores.
> > > > > > > > > -	 * rte_smp_wmb() is enough to synchronize this event. If
> > > > > one of
> > > > > > > > > -	 * freed memsegs is seen by other core, that means the
> > > > > memseg
> > > > > > > > > -	 * has been allocated by allocator, which will come after this
> > > > > > > > > -	 * free call. Therefore, this store instruction (incrementing
> > > > > > > > > -	 * generation below) will be guaranteed to be seen by other
> > > > > core
> > > > > > > > > -	 * before the core sees the newly allocated memory.
> > > > > > > > > +	 * rte_smp_wmb() is to keep the order that dev_gen
> > > > > updated
> > > > > > > > > before
> > > > > > > > > +	 * rebuilding global cache. Therefore, other core can
> > > > > > > > > +flush
> > > > > their
> > > > > > > > > +	 * local cache on time.
> > > > > > > > >  	 */
> > > > > > > > > -	++sh->share_cache.dev_gen;
> > > > > > > > > -	DEBUG("broadcasting local cache flush, gen=%d",
> > > > > > > > > -	      sh->share_cache.dev_gen);
> > > > > > > > >  	rte_smp_wmb();
> > > > > > > > > +	mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > > > > > >  	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > > > > > >  	return 0;
> > > > > > > > >  }
> > > > > > > > > --
> > > > > > > > > 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-06 11:21                   ` [dpdk-dev] " Slava Ovsiienko
@ 2021-05-07  6:36                     ` Feifei Wang
  2021-05-07 10:14                       ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-07  6:36 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Slava

Thanks very much for your reply.

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月6日 19:22
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Sorry, I do not follow why we should get rid of the last (after dev_gen update)
> wmb.
> We've rebuilt the global cache, we should notify other agents it's happened
> and they should flush local caches. So, dev_gen change should be made
> visible to other agents to trigger this activity and the second wmb is here to
> ensure this.

1. For the first problem why we should get rid of the last wmb and move it before
dev_gen updated, I think our attention is how the wmb implements the
synchronization between multiple agents.
					Fig1
-----------------------------------------------------------------------------------------------------
Timeslot        		agent_1               		   agent_2
1          		rebuild global cache
2                       		wmb           		
3            		     update dev_gen ----------------------- load changed dev_gen
4                                  			        	           rebuild local cache
-----------------------------------------------------------------------------------------------------

First, wmb is only for local thread to keep the order between local write-write :
Based on the picture above, for agent_1, wmb keeps the order that rebuilding global
cache is always before updating dev_gen. 

Second, agent_1 communicates with agent_2 by the global variable "dev_gen" :
If agent_1 updates dev_gen, agent_2 will load it and then it knows it should rebuild
local cache

Finally, agent_2 rebuilds local cache according to whether agent_1 has rebuilt global
cache, and agent_2 knows this information by the variable "dev_gen". 
					Fig2
-----------------------------------------------------------------------------------------------------
Timeslot        		agent_1               		   agent_2
1		        update dev_gen
2						      load changed dev_gen
3						          rebuild local cache
4        		    rebuild global cache
5			 wmb		
-----------------------------------------------------------------------------------------------------

However, in arm platform, if wmb is after dev_gen updated, "dev_gen" may be
updated before agent_1 rebuilding global cache, then agent_2 maybe receive
error message and rebuild its local cache in advance.

To summarize, it is not important which time other agents can see the changed
global variable "dev_gen".
(Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is committed to the global). 
It is more important that if other agents see the changed "dev_gen", they also can
know global cache has been updated.
 
> One more point, due to registering new/destroying existing MR involves FW
> (via kernel) calls, it takes so many CPU cycles that we could neglect wmb
> overhead at all.

We just move the last wmb into the right place, and not delete it for performance.

> 
> Also, regarding this:
> 
>  > > Another question suddenly occurred to me, in order to keep the  > >
> order that rebuilding global cache before updating ”dev_gen“, the  > > wmb
> should be before updating "dev_gen" rather than after it.
>  > > Otherwise, in the out-of-order platforms, current order cannot be kept.
> 
> it is not clear why ordering is important - global cache update and dev_gen
> change happen under spinlock protection, so only the last wmb  is
> meaningful.
> 

2. The second function of wmb before "dev_gen" updated is for performance
according to our previous discussion.  
According to Fig2, if there is no wmb between "global cache updated" and
"dev_gen updated", "dev_gen" may update before global cache updated. 

Then agent_2 may see the changed "dev_gen" and flush entire local cache
in advance.
 
This entire flush can degrade the performance:
"the local cache is getting empty and can't provide translation for other valid 
(not being removed) MRs, and the translation has to look up in the global cache,
that is locked now for rebuilding, this causes the delays in data path on acquiring 
global cache lock." 

Furthermore, spinlock is just for global cache, not for dev_gen and local cache.

> To summarize, in my opinion:
> - if you see some issue with ordering of global cache update/dev_gen
> signalling,
>   could you, please, elaborate? I'm not sure we should maintain an order (due
> to spinlock protection)
> - the last rte_smp_wmb() after dev_gen incrementing should be kept intact
> 

At last, for my view, there are two functions that moving wmb before "dev_gen"
for the write-write order:
--------------------------------
a) rebuild global cache;
b) rte_smp_wmb();
c) updating dev_gen
--------------------------------
1. Achieve synchronization between multiple threads in the right way
2. Prevent other agents from flushing local cache early to ensure performance

Best Regards
Feifei
 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Feifei Wang <Feifei.Wang2@arm.com>
> > Sent: Thursday, May 6, 2021 5:52
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region cache
> >
> > Hi, Slava
> >
> > Would you have more comments about this patch?
> > For my sight, only one wmb before "dev_gen" updating is enough to
> > synchronize.
> >
> > Thanks very much for your attention.
> >
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Feifei Wang
> > > 发送时间: 2021年4月20日 16:42
> > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Slava
> > >
> > > I think the second wmb can be removed.
> > > As I know, wmb is just a barrier to keep the order between write and
> write.
> > > and it cannot tell the CPU when it should commit the changes.
> > >
> > > It is usually used before guard variable to keep the order that
> > > updating guard variable after some changes, which you want to
> > > release,
> > have been done.
> > >
> > > For example, for the wmb  after global cache update/before altering
> > > dev_gen, it can ensure the order that updating global cache before
> > > altering
> > > dev_gen:
> > > 1)If other agent load the changed "dev_gen", it can know the global
> > > cache has been updated.
> > > 2)If other agents load the unchanged, "dev_gen", it means the global
> > > cache has not been updated, and the local cache will not be flushed.
> > >
> > > As a result, we use  wmb and guard variable "dev_gen" to ensure the
> > > global cache updating is "visible".
> > > The "visible" means when updating guard variable "dev_gen" is known
> > > by other agents, they also can confirm global cache has been updated
> > > in the meanwhile. Thus, just one wmb before altering  dev_gen can
> > > ensure
> > this.
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > -----邮件原件-----
> > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > 发送时间: 2021年4月20日 15:54
> > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>;
> nd
> > > > <nd@arm.com>
> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > > cache
> > > >
> > > > Hi, Feifei
> > > >
> > > > In my opinion, there should be 2 barriers:
> > > >  - after global cache update/before altering dev_gen, to ensure
> > > > the correct order
> > > >  - after altering dev_gen to make this change visible for other
> > > > agents and to trigger local cache update
> > > >
> > > > With best regards,
> > > > Slava
> > > >
> > > > > -----Original Message-----
> > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>;
> > nd
> > > > > <nd@arm.com>
> > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > Another question suddenly occurred to me, in order to keep the
> > > > > order that rebuilding global cache before updating ”dev_gen“,
> > > > > the wmb should be before updating "dev_gen" rather than after it.
> > > > > Otherwise, in the out-of-order platforms, current order cannot be
> kept.
> > > > >
> > > > > Thus, we should change the code as:
> > > > > a) rebuild global cache;
> > > > > b) rte_smp_wmb();
> > > > > c) updating dev_gen
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > > > -----邮件原件-----
> > > > > > 发件人: Feifei Wang
> > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region
> > > > > > cache
> > > > > >
> > > > > > Hi, Slava
> > > > > >
> > > > > > Thanks very much for your explanation.
> > > > > >
> > > > > > I can understand the app can wait all mbufs are returned to
> > > > > > the memory pool, and then it can free this mbufs, I agree with this.
> > > > > >
> > > > > > As a result, I will remove the bug fix patch from this series
> > > > > > and just replace the smp barrier with C11 thread fence. Thanks
> > > > > > very much for your patient explanation again.
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > > Region cache
> > > > > > >
> > > > > > > Hi, Feifei
> > > > > > >
> > > > > > > Please, see below
> > > > > > >
> > > > > > > ....
> > > > > > >
> > > > > > > > > Hi, Feifei
> > > > > > > > >
> > > > > > > > > Sorry, I do not follow what this patch fixes. Do we have
> > > > > > > > > some issue/bug with MR cache in practice?
> > > > > > > >
> > > > > > > > This patch fixes the bug which is based on logical
> > > > > > > > deduction, and it doesn't actually happen.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Each Tx queue has its own dedicated "local" cache for
> > > > > > > > > MRs to convert buffer address in mbufs being transmitted
> > > > > > > > > to LKeys (HW-related entity
> > > > > > > > > handle) and the "global" cache for all MR registered on
> > > > > > > > > the
> > device.
> > > > > > > > >
> > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > - check the local queue cache flush request
> > > > > > > > > - lookup in local cache
> > > > > > > > > - if not found:
> > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > - lookup in global cache
> > > > > > > > > - release lock for global cache
> > > > > > > > >
> > > > > > > > > How cache update on memory freeing/unregistering happens:
> > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > > - [b] set local caches flush request
> > > > > > > > > - free global cache lock
> > > > > > > > >
> > > > > > > > > If I understand correctly, your patch swaps [a] and [b],
> > > > > > > > > and local caches flush is requested earlier. What
> > > > > > > > > problem does it
> > solve?
> > > > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > > > referencing to the memory being freed. Application must
> > > > > > > > > ensure this and must not allocate new mbufs from this
> > > > > > > > > memory regions
> > > > being freed.
> > > > > > > > > Hence, the lookups for these MRs in caches should not occur.
> > > > > > > >
> > > > > > > > For your first point that, application can take charge of
> > > > > > > > preventing MR freed memory being allocated to data path.
> > > > > > > >
> > > > > > > > Does it means that If there is an emergency of MR
> > > > > > > > fragment, such as hotplug, the application must inform
> > > > > > > > thedata path in advance, and this memory will not be
> > > > > > > > allocated, and then the control path will free this
> > > > > > > > memory? If application  can do like this, I agree that
> > > > > > > > this bug
> > > > > > cannot happen.
> > > > > > >
> > > > > > > Actually,  this is the only correct way for application to operate.
> > > > > > > Let's suppose we have some memory area that application
> > > > > > > wants to
> > > > free.
> > > > > > > ALL references to this area must be removed. If we have some
> > > > > > > mbufs allocated from this area, it means that we have memory
> > > > > > > pool created
> > > > > there.
> > > > > > >
> > > > > > > What application should do:
> > > > > > > - notify all its components/agents the memory area is going
> > > > > > > to be freed
> > > > > > > - all components/agents free the mbufs they might own
> > > > > > > - PMD might not support freeing for some mbufs (for example
> > > > > > > being sent and awaiting for completion), so app should just
> > > > > > > wait
> > > > > > > - wait till all mbufs are returned to the memory pool (by
> > > > > > > monitoring available obj == pool size)
> > > > > > >
> > > > > > > Otherwise - it is dangerous to free the memory. There are
> > > > > > > just some mbufs still allocated, it is regardless to buf
> > > > > > > address to MR translation. We just can't free the memory -
> > > > > > > the mapping will be destroyed and might cause the
> > > > > > > segmentation fault by SW or some HW issues on DMA access to
> > > > > > > unmapped memory.  It is very generic safety approach - do
> > > > > > > not free the memory that is still in
> > use.
> > > > > > > Hence, at the moment of freeing and unregistering the MR,
> > > > > > > there MUST BE NO any
> > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > No translation to MR being invalidated can happen.
> > > > > > >
> > > > > > > >
> > > > > > > > > For other side, the cache flush has negative effect -
> > > > > > > > > the local cache is getting empty and can't provide
> > > > > > > > > translation for other valid (not being removed) MRs, and
> > > > > > > > > the translation has to look up in the global cache, that
> > > > > > > > > is locked now for rebuilding, this causes the delays in
> > > > > > > > > datapatch
> > > > > > > > on acquiring global cache lock.
> > > > > > > > > So, I see some potential performance impact.
> > > > > > > >
> > > > > > > > If above assumption is true, we can go to your second point.
> > > > > > > > I think this is a problem of the tradeoff between cache
> > > > > > > > coherence and
> > > > > > > performance.
> > > > > > > >
> > > > > > > > I can understand your meaning that though global cache has
> > > > > > > > been changed, we should keep the valid MR in local cache
> > > > > > > > as long as possible to ensure the fast searching speed.
> > > > > > > > In the meanwhile, the local cache can be rebuilt later to
> > > > > > > > reduce its waiting time for acquiring the global cache lock.
> > > > > > > >
> > > > > > > > However,  this mechanism just ensures the performance
> > > > > > > > unchanged for the first few mbufs.
> > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > > Thus, no matter whether there is this patch or not,  the
> > > > > > > > performance will jitter in a certain period of
> > > > > time.
> > > > > > >
> > > > > > > Local cache should be updated to remove MRs no longer valid.
> > > > > > > But we just flush the entire cache.
> > > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in
> > > > > > > local
> > > cache.
> > > > > > > And there are traffic in the datapath for MR0 and MR1, and
> > > > > > > no traffic for MRX anymore.
> > > > > > >
> > > > > > > 1) If we do as you propose:
> > > > > > > a) take a lock
> > > > > > > b) request flush local cache first - all MR0, MR1, MRX will
> > > > > > > be removed on translation in datapath
> > > > > > > c) update global cache,
> > > > > > > d) free lock
> > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on
> > > > > > > lock taken for cache update since point b) till point d).
> > > > > > >
> > > > > > > 2) If we do as it is implemented now:
> > > > > > > a) take a lock
> > > > > > > b) update global cache
> > > > > > > c) request flush local cache
> > > > > > > d) free lock
> > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in
> > > > > > > local cache (not happens for MR0 and MR1, must not happen
> > > > > > > for MRX), and probability should be minor. And lock might
> > > > > > > happen since
> > > > > > > c) till
> > > > > > > d)
> > > > > > > - quite short period of time
> > > > > > >
> > > > > > > Summary, the difference between 1) and 2)
> > > > > > >
> > > > > > > Lock probability:
> > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > > >
> > > > > > > Lock duration:
> > > > > > > - 1) lock since b) till d),
> > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > >
> > > > > > > >
> > > > > > > > Finally, in conclusion, I tend to think that the bottom
> > > > > > > > layer can do more things to ensure the correct execution
> > > > > > > > of the program, which may have a negative impact on the
> > > > > > > > performance in a short time, but in the long run, the
> > > > > > > > performance will eventually
> > > > come back.
> > > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > > performance in the stable period, and try our best to
> > > > > > > > ensure the correctness of the program in case of
> > > > > > > emergencies.
> > > > > > >
> > > > > > > If we have some mbufs still allocated in memory being freed
> > > > > > > - there is nothing to say about correctness, it is totally
> > > > > > > incorrect. In my opinion, we should not think how to
> > > > > > > mitigate this incorrect behavior, we should not encourage
> > > > > > > application developers to follow the wrong
> > > > > > approaches.
> > > > > > >
> > > > > > > With best regards,
> > > > > > > Slava
> > > > > > >
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > >
> > > > > > > > > With best regards,
> > > > > > > > > Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-07  6:36                     ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-05-07 10:14                       ` Slava Ovsiienko
  2021-05-08  3:13                         ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-07 10:14 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Feifei

We should consider the locks in your scenario - it is crucial for the complete model description:

How agent_1 (in your terms) rebuilds global cache:

1a) lock()
1b) rebuild(global cache)
1c) update(dev_gen)
1d) wmb()
1e) unlock()

How agent_2 checks:

2a) check(dev_gen) (assume positive - changed)
2b) clear(local_cache)
2c) miss(on empty local_cache) - > eventually it goes to mr_lookup_caches()
2d) lock()
2e) get(new MR)
2f) unlock()
2g) update(local cache with obtained new MR)

Hence, even if 1c) becomes visible in 2a) before 1b) committed (say, due to out-of-order Arch) - the agent 2
would be blocked on 2d) and scenario depicted on your Fig2 would not happen (agent_2 will wait before step 3
till agent 1 unlocks after its step 5).

With best regards,
Slava

> -----Original Message-----
> From: Feifei Wang <Feifei.Wang2@arm.com>
> Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Thanks very much for your reply.
> 
> > -----邮件原件-----
> > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > 发送时间: 2021年5月6日 19:22
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > Sorry, I do not follow why we should get rid of the last (after
> > dev_gen update) wmb.
> > We've rebuilt the global cache, we should notify other agents it's
> > happened and they should flush local caches. So, dev_gen change should
> > be made visible to other agents to trigger this activity and the
> > second wmb is here to ensure this.
> 
> 1. For the first problem why we should get rid of the last wmb and move it
> before dev_gen updated, I think our attention is how the wmb implements
> the synchronization between multiple agents.
> 					Fig1
> ----------------------------------------------------------------------------------------------
> -------
> Timeslot        		agent_1               		   agent_2
> 1          		rebuild global cache
> 2                       		wmb
> 3            		     update dev_gen ----------------------- load changed
> dev_gen
> 4                                  			        	           rebuild local cache
> ----------------------------------------------------------------------------------------------
> -------
> 
> First, wmb is only for local thread to keep the order between local write-
> write :
> Based on the picture above, for agent_1, wmb keeps the order that
> rebuilding global cache is always before updating dev_gen.
> 
> Second, agent_1 communicates with agent_2 by the global variable
> "dev_gen" :
> If agent_1 updates dev_gen, agent_2 will load it and then it knows it should
> rebuild local cache
> 
> Finally, agent_2 rebuilds local cache according to whether agent_1 has rebuilt
> global cache, and agent_2 knows this information by the variable "dev_gen".
> 					Fig2
> ----------------------------------------------------------------------------------------------
> -------
> Timeslot        		agent_1               		   agent_2
> 1		        update dev_gen
> 2						      load changed dev_gen
> 3						          rebuild local cache
> 4        		    rebuild global cache
> 5			 wmb
> ----------------------------------------------------------------------------------------------
> -------
> 
> However, in arm platform, if wmb is after dev_gen updated, "dev_gen" may
> be updated before agent_1 rebuilding global cache, then agent_2 maybe
> receive error message and rebuild its local cache in advance.
> 
> To summarize, it is not important which time other agents can see the
> changed global variable "dev_gen".
> (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is
> committed to the global).
> It is more important that if other agents see the changed "dev_gen", they
> also can know global cache has been updated.
> 
> > One more point, due to registering new/destroying existing MR involves
> > FW (via kernel) calls, it takes so many CPU cycles that we could
> > neglect wmb overhead at all.
> 
> We just move the last wmb into the right place, and not delete it for
> performance.
> 
> >
> > Also, regarding this:
> >
> >  > > Another question suddenly occurred to me, in order to keep the  >
> > > order that rebuilding global cache before updating ”dev_gen“, the  >
> > > wmb should be before updating "dev_gen" rather than after it.
> >  > > Otherwise, in the out-of-order platforms, current order cannot be
> kept.
> >
> > it is not clear why ordering is important - global cache update and
> > dev_gen change happen under spinlock protection, so only the last wmb
> > is meaningful.
> >
> 
> 2. The second function of wmb before "dev_gen" updated is for
> performance according to our previous discussion.
> According to Fig2, if there is no wmb between "global cache updated" and
> "dev_gen updated", "dev_gen" may update before global cache updated.
> 
> Then agent_2 may see the changed "dev_gen" and flush entire local cache in
> advance.
> 
> This entire flush can degrade the performance:
> "the local cache is getting empty and can't provide translation for other valid
> (not being removed) MRs, and the translation has to look up in the global
> cache, that is locked now for rebuilding, this causes the delays in data path on
> acquiring global cache lock."
> 
> Furthermore, spinlock is just for global cache, not for dev_gen and local
> cache.
> 
> > To summarize, in my opinion:
> > - if you see some issue with ordering of global cache update/dev_gen
> > signalling,
> >   could you, please, elaborate? I'm not sure we should maintain an
> > order (due to spinlock protection)
> > - the last rte_smp_wmb() after dev_gen incrementing should be kept
> > intact
> >
> 
> At last, for my view, there are two functions that moving wmb before
> "dev_gen"
> for the write-write order:
> --------------------------------
> a) rebuild global cache;
> b) rte_smp_wmb();
> c) updating dev_gen
> --------------------------------
> 1. Achieve synchronization between multiple threads in the right way 2.
> Prevent other agents from flushing local cache early to ensure performance
> 
> Best Regards
> Feifei
> 
> > With best regards,
> > Slava
> >
> > > -----Original Message-----
> > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > Sent: Thursday, May 6, 2021 5:52
> > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region cache
> > >
> > > Hi, Slava
> > >
> > > Would you have more comments about this patch?
> > > For my sight, only one wmb before "dev_gen" updating is enough to
> > > synchronize.
> > >
> > > Thanks very much for your attention.
> > >
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > -----邮件原件-----
> > > > 发件人: Feifei Wang
> > > > 发送时间: 2021年4月20日 16:42
> > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> Region
> > > > cache
> > > >
> > > > Hi, Slava
> > > >
> > > > I think the second wmb can be removed.
> > > > As I know, wmb is just a barrier to keep the order between write
> > > > and
> > write.
> > > > and it cannot tell the CPU when it should commit the changes.
> > > >
> > > > It is usually used before guard variable to keep the order that
> > > > updating guard variable after some changes, which you want to
> > > > release,
> > > have been done.
> > > >
> > > > For example, for the wmb  after global cache update/before
> > > > altering dev_gen, it can ensure the order that updating global
> > > > cache before altering
> > > > dev_gen:
> > > > 1)If other agent load the changed "dev_gen", it can know the
> > > > global cache has been updated.
> > > > 2)If other agents load the unchanged, "dev_gen", it means the
> > > > global cache has not been updated, and the local cache will not be
> flushed.
> > > >
> > > > As a result, we use  wmb and guard variable "dev_gen" to ensure
> > > > the global cache updating is "visible".
> > > > The "visible" means when updating guard variable "dev_gen" is
> > > > known by other agents, they also can confirm global cache has been
> > > > updated in the meanwhile. Thus, just one wmb before altering
> > > > dev_gen can ensure
> > > this.
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > 发送时间: 2021年4月20日 15:54
> > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>;
> > nd
> > > > > <nd@arm.com>
> > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Feifei
> > > > >
> > > > > In my opinion, there should be 2 barriers:
> > > > >  - after global cache update/before altering dev_gen, to ensure
> > > > > the correct order
> > > > >  - after altering dev_gen to make this change visible for other
> > > > > agents and to trigger local cache update
> > > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>;
> > > nd
> > > > > > <nd@arm.com>
> > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > Memory Region cache
> > > > > >
> > > > > > Hi, Slava
> > > > > >
> > > > > > Another question suddenly occurred to me, in order to keep the
> > > > > > order that rebuilding global cache before updating ”dev_gen“,
> > > > > > the wmb should be before updating "dev_gen" rather than after it.
> > > > > > Otherwise, in the out-of-order platforms, current order cannot
> > > > > > be
> > kept.
> > > > > >
> > > > > > Thus, we should change the code as:
> > > > > > a) rebuild global cache;
> > > > > > b) rte_smp_wmb();
> > > > > > c) updating dev_gen
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Feifei Wang
> > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region
> > > > > > > cache
> > > > > > >
> > > > > > > Hi, Slava
> > > > > > >
> > > > > > > Thanks very much for your explanation.
> > > > > > >
> > > > > > > I can understand the app can wait all mbufs are returned to
> > > > > > > the memory pool, and then it can free this mbufs, I agree with
> this.
> > > > > > >
> > > > > > > As a result, I will remove the bug fix patch from this
> > > > > > > series and just replace the smp barrier with C11 thread
> > > > > > > fence. Thanks very much for your patient explanation again.
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > Ruifeng
> > > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Feifei
> > > > > > > >
> > > > > > > > Please, see below
> > > > > > > >
> > > > > > > > ....
> > > > > > > >
> > > > > > > > > > Hi, Feifei
> > > > > > > > > >
> > > > > > > > > > Sorry, I do not follow what this patch fixes. Do we
> > > > > > > > > > have some issue/bug with MR cache in practice?
> > > > > > > > >
> > > > > > > > > This patch fixes the bug which is based on logical
> > > > > > > > > deduction, and it doesn't actually happen.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Each Tx queue has its own dedicated "local" cache for
> > > > > > > > > > MRs to convert buffer address in mbufs being
> > > > > > > > > > transmitted to LKeys (HW-related entity
> > > > > > > > > > handle) and the "global" cache for all MR registered
> > > > > > > > > > on the
> > > device.
> > > > > > > > > >
> > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > - lookup in local cache
> > > > > > > > > > - if not found:
> > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > - lookup in global cache
> > > > > > > > > > - release lock for global cache
> > > > > > > > > >
> > > > > > > > > > How cache update on memory freeing/unregistering
> happens:
> > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > - free global cache lock
> > > > > > > > > >
> > > > > > > > > > If I understand correctly, your patch swaps [a] and
> > > > > > > > > > [b], and local caches flush is requested earlier. What
> > > > > > > > > > problem does it
> > > solve?
> > > > > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > > > > referencing to the memory being freed. Application
> > > > > > > > > > must ensure this and must not allocate new mbufs from
> > > > > > > > > > this memory regions
> > > > > being freed.
> > > > > > > > > > Hence, the lookups for these MRs in caches should not
> occur.
> > > > > > > > >
> > > > > > > > > For your first point that, application can take charge
> > > > > > > > > of preventing MR freed memory being allocated to data path.
> > > > > > > > >
> > > > > > > > > Does it means that If there is an emergency of MR
> > > > > > > > > fragment, such as hotplug, the application must inform
> > > > > > > > > thedata path in advance, and this memory will not be
> > > > > > > > > allocated, and then the control path will free this
> > > > > > > > > memory? If application  can do like this, I agree that
> > > > > > > > > this bug
> > > > > > > cannot happen.
> > > > > > > >
> > > > > > > > Actually,  this is the only correct way for application to operate.
> > > > > > > > Let's suppose we have some memory area that application
> > > > > > > > wants to
> > > > > free.
> > > > > > > > ALL references to this area must be removed. If we have
> > > > > > > > some mbufs allocated from this area, it means that we have
> > > > > > > > memory pool created
> > > > > > there.
> > > > > > > >
> > > > > > > > What application should do:
> > > > > > > > - notify all its components/agents the memory area is
> > > > > > > > going to be freed
> > > > > > > > - all components/agents free the mbufs they might own
> > > > > > > > - PMD might not support freeing for some mbufs (for
> > > > > > > > example being sent and awaiting for completion), so app
> > > > > > > > should just wait
> > > > > > > > - wait till all mbufs are returned to the memory pool (by
> > > > > > > > monitoring available obj == pool size)
> > > > > > > >
> > > > > > > > Otherwise - it is dangerous to free the memory. There are
> > > > > > > > just some mbufs still allocated, it is regardless to buf
> > > > > > > > address to MR translation. We just can't free the memory -
> > > > > > > > the mapping will be destroyed and might cause the
> > > > > > > > segmentation fault by SW or some HW issues on DMA access
> > > > > > > > to unmapped memory.  It is very generic safety approach -
> > > > > > > > do not free the memory that is still in
> > > use.
> > > > > > > > Hence, at the moment of freeing and unregistering the MR,
> > > > > > > > there MUST BE NO any
> > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > For other side, the cache flush has negative effect -
> > > > > > > > > > the local cache is getting empty and can't provide
> > > > > > > > > > translation for other valid (not being removed) MRs,
> > > > > > > > > > and the translation has to look up in the global
> > > > > > > > > > cache, that is locked now for rebuilding, this causes
> > > > > > > > > > the delays in datapatch
> > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > >
> > > > > > > > > If above assumption is true, we can go to your second point.
> > > > > > > > > I think this is a problem of the tradeoff between cache
> > > > > > > > > coherence and
> > > > > > > > performance.
> > > > > > > > >
> > > > > > > > > I can understand your meaning that though global cache
> > > > > > > > > has been changed, we should keep the valid MR in local
> > > > > > > > > cache as long as possible to ensure the fast searching speed.
> > > > > > > > > In the meanwhile, the local cache can be rebuilt later
> > > > > > > > > to reduce its waiting time for acquiring the global cache lock.
> > > > > > > > >
> > > > > > > > > However,  this mechanism just ensures the performance
> > > > > > > > > unchanged for the first few mbufs.
> > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > > > Thus, no matter whether there is this patch or not,  the
> > > > > > > > > performance will jitter in a certain period of
> > > > > > time.
> > > > > > > >
> > > > > > > > Local cache should be updated to remove MRs no longer valid.
> > > > > > > > But we just flush the entire cache.
> > > > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in
> > > > > > > > local
> > > > cache.
> > > > > > > > And there are traffic in the datapath for MR0 and MR1, and
> > > > > > > > no traffic for MRX anymore.
> > > > > > > >
> > > > > > > > 1) If we do as you propose:
> > > > > > > > a) take a lock
> > > > > > > > b) request flush local cache first - all MR0, MR1, MRX
> > > > > > > > will be removed on translation in datapath
> > > > > > > > c) update global cache,
> > > > > > > > d) free lock
> > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked
> > > > > > > > on lock taken for cache update since point b) till point d).
> > > > > > > >
> > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > a) take a lock
> > > > > > > > b) update global cache
> > > > > > > > c) request flush local cache
> > > > > > > > d) free lock
> > > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in
> > > > > > > > local cache (not happens for MR0 and MR1, must not happen
> > > > > > > > for MRX), and probability should be minor. And lock might
> > > > > > > > happen since
> > > > > > > > c) till
> > > > > > > > d)
> > > > > > > > - quite short period of time
> > > > > > > >
> > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > >
> > > > > > > > Lock probability:
> > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > > > >
> > > > > > > > Lock duration:
> > > > > > > > - 1) lock since b) till d),
> > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Finally, in conclusion, I tend to think that the bottom
> > > > > > > > > layer can do more things to ensure the correct execution
> > > > > > > > > of the program, which may have a negative impact on the
> > > > > > > > > performance in a short time, but in the long run, the
> > > > > > > > > performance will eventually
> > > > > come back.
> > > > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > > > performance in the stable period, and try our best to
> > > > > > > > > ensure the correctness of the program in case of
> > > > > > > > emergencies.
> > > > > > > >
> > > > > > > > If we have some mbufs still allocated in memory being
> > > > > > > > freed
> > > > > > > > - there is nothing to say about correctness, it is totally
> > > > > > > > incorrect. In my opinion, we should not think how to
> > > > > > > > mitigate this incorrect behavior, we should not encourage
> > > > > > > > application developers to follow the wrong
> > > > > > > approaches.
> > > > > > > >
> > > > > > > > With best regards,
> > > > > > > > Slava
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Feifei
> > > > > > > > >
> > > > > > > > > > With best regards,
> > > > > > > > > > Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-07 10:14                       ` [dpdk-dev] " Slava Ovsiienko
@ 2021-05-08  3:13                         ` Feifei Wang
  2021-05-11  8:18                           ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-08  3:13 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd

Hi, Slava

Thanks for your explanation.

Thus we can ignore the order between update global cache and
update dev_gen due to R/W lock. 

Furthermore, it is unnecessary to keep wmb, and the last wmb (1d)
I think can be removed. 
Two reasons for this:
1. wmb has only one function, this is for the local thread to keep
the write-write order. It cannot ensure write operation above it can be
seen by other threads.

2. rwunlock (1e) has a atomic_release operation in it, I think this release
operation  is  same as the last wmb : keep order.

Best Regards
Feifei

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月7日 18:15
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> We should consider the locks in your scenario - it is crucial for the complete
> model description:
> 
> How agent_1 (in your terms) rebuilds global cache:
> 
> 1a) lock()
> 1b) rebuild(global cache)
> 1c) update(dev_gen)
> 1d) wmb()
> 1e) unlock()
> 
> How agent_2 checks:
> 
> 2a) check(dev_gen) (assume positive - changed)
> 2b) clear(local_cache)
> 2c) miss(on empty local_cache) - > eventually it goes to mr_lookup_caches()
> 2d) lock()
> 2e) get(new MR)
> 2f) unlock()
> 2g) update(local cache with obtained new MR)
> 
> Hence, even if 1c) becomes visible in 2a) before 1b) committed (say, due to
> out-of-order Arch) - the agent 2 would be blocked on 2d) and scenario
> depicted on your Fig2 would not happen (agent_2 will wait before step 3 till
> agent 1 unlocks after its step 5).
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Feifei Wang <Feifei.Wang2@arm.com>
> > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> > Shuler <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region cache
> >
> > Hi, Slava
> >
> > Thanks very much for your reply.
> >
> > > -----邮件原件-----
> > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > 发送时间: 2021年5月6日 19:22
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Feifei
> > >
> > > Sorry, I do not follow why we should get rid of the last (after
> > > dev_gen update) wmb.
> > > We've rebuilt the global cache, we should notify other agents it's
> > > happened and they should flush local caches. So, dev_gen change
> > > should be made visible to other agents to trigger this activity and
> > > the second wmb is here to ensure this.
> >
> > 1. For the first problem why we should get rid of the last wmb and
> > move it before dev_gen updated, I think our attention is how the wmb
> > implements the synchronization between multiple agents.
> > 					Fig1
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> > Timeslot        		agent_1               		   agent_2
> > 1          		rebuild global cache
> > 2                       		wmb
> > 3            		     update dev_gen ----------------------- load changed
> > dev_gen
> > 4                                  			        	           rebuild local cache
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> >
> > First, wmb is only for local thread to keep the order between local
> > write- write :
> > Based on the picture above, for agent_1, wmb keeps the order that
> > rebuilding global cache is always before updating dev_gen.
> >
> > Second, agent_1 communicates with agent_2 by the global variable
> > "dev_gen" :
> > If agent_1 updates dev_gen, agent_2 will load it and then it knows it
> > should rebuild local cache
> >
> > Finally, agent_2 rebuilds local cache according to whether agent_1 has
> > rebuilt global cache, and agent_2 knows this information by the variable
> "dev_gen".
> > 					Fig2
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> > Timeslot        		agent_1               		   agent_2
> > 1		        update dev_gen
> > 2						      load changed dev_gen
> > 3						          rebuild local cache
> > 4        		    rebuild global cache
> > 5			 wmb
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> >
> > However, in arm platform, if wmb is after dev_gen updated, "dev_gen"
> > may be updated before agent_1 rebuilding global cache, then agent_2
> > maybe receive error message and rebuild its local cache in advance.
> >
> > To summarize, it is not important which time other agents can see the
> > changed global variable "dev_gen".
> > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is
> > committed to the global).
> > It is more important that if other agents see the changed "dev_gen",
> > they also can know global cache has been updated.
> >
> > > One more point, due to registering new/destroying existing MR
> > > involves FW (via kernel) calls, it takes so many CPU cycles that we
> > > could neglect wmb overhead at all.
> >
> > We just move the last wmb into the right place, and not delete it for
> > performance.
> >
> > >
> > > Also, regarding this:
> > >
> > >  > > Another question suddenly occurred to me, in order to keep the
> > > >
> > > > order that rebuilding global cache before updating ”dev_gen“, the
> > > > > wmb should be before updating "dev_gen" rather than after it.
> > >  > > Otherwise, in the out-of-order platforms, current order cannot
> > > be
> > kept.
> > >
> > > it is not clear why ordering is important - global cache update and
> > > dev_gen change happen under spinlock protection, so only the last
> > > wmb is meaningful.
> > >
> >
> > 2. The second function of wmb before "dev_gen" updated is for
> > performance according to our previous discussion.
> > According to Fig2, if there is no wmb between "global cache updated"
> > and "dev_gen updated", "dev_gen" may update before global cache
> updated.
> >
> > Then agent_2 may see the changed "dev_gen" and flush entire local
> > cache in advance.
> >
> > This entire flush can degrade the performance:
> > "the local cache is getting empty and can't provide translation for
> > other valid (not being removed) MRs, and the translation has to look
> > up in the global cache, that is locked now for rebuilding, this causes
> > the delays in data path on acquiring global cache lock."
> >
> > Furthermore, spinlock is just for global cache, not for dev_gen and
> > local cache.
> >
> > > To summarize, in my opinion:
> > > - if you see some issue with ordering of global cache update/dev_gen
> > > signalling,
> > >   could you, please, elaborate? I'm not sure we should maintain an
> > > order (due to spinlock protection)
> > > - the last rte_smp_wmb() after dev_gen incrementing should be kept
> > > intact
> > >
> >
> > At last, for my view, there are two functions that moving wmb before
> > "dev_gen"
> > for the write-write order:
> > --------------------------------
> > a) rebuild global cache;
> > b) rte_smp_wmb();
> > c) updating dev_gen
> > --------------------------------
> > 1. Achieve synchronization between multiple threads in the right way 2.
> > Prevent other agents from flushing local cache early to ensure
> > performance
> >
> > Best Regards
> > Feifei
> >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > Sent: Thursday, May 6, 2021 5:52
> > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Would you have more comments about this patch?
> > > > For my sight, only one wmb before "dev_gen" updating is enough to
> > > > synchronize.
> > > >
> > > > Thanks very much for your attention.
> > > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Feifei Wang
> > > > > 发送时间: 2021年4月20日 16:42
> > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region
> > > > > cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > I think the second wmb can be removed.
> > > > > As I know, wmb is just a barrier to keep the order between write
> > > > > and
> > > write.
> > > > > and it cannot tell the CPU when it should commit the changes.
> > > > >
> > > > > It is usually used before guard variable to keep the order that
> > > > > updating guard variable after some changes, which you want to
> > > > > release,
> > > > have been done.
> > > > >
> > > > > For example, for the wmb  after global cache update/before
> > > > > altering dev_gen, it can ensure the order that updating global
> > > > > cache before altering
> > > > > dev_gen:
> > > > > 1)If other agent load the changed "dev_gen", it can know the
> > > > > global cache has been updated.
> > > > > 2)If other agents load the unchanged, "dev_gen", it means the
> > > > > global cache has not been updated, and the local cache will not
> > > > > be
> > flushed.
> > > > >
> > > > > As a result, we use  wmb and guard variable "dev_gen" to ensure
> > > > > the global cache updating is "visible".
> > > > > The "visible" means when updating guard variable "dev_gen" is
> > > > > known by other agents, they also can confirm global cache has
> > > > > been updated in the meanwhile. Thus, just one wmb before
> > > > > altering dev_gen can ensure
> > > > this.
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> <nd@arm.com>;
> > > nd
> > > > > > <nd@arm.com>
> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > In my opinion, there should be 2 barriers:
> > > > > >  - after global cache update/before altering dev_gen, to
> > > > > > ensure the correct order
> > > > > >  - after altering dev_gen to make this change visible for
> > > > > > other agents and to trigger local cache update
> > > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > <nd@arm.com>;
> > > > nd
> > > > > > > <nd@arm.com>
> > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > Memory Region cache
> > > > > > >
> > > > > > > Hi, Slava
> > > > > > >
> > > > > > > Another question suddenly occurred to me, in order to keep
> > > > > > > the order that rebuilding global cache before updating
> > > > > > > ”dev_gen“, the wmb should be before updating "dev_gen" rather
> than after it.
> > > > > > > Otherwise, in the out-of-order platforms, current order
> > > > > > > cannot be
> > > kept.
> > > > > > >
> > > > > > > Thus, we should change the code as:
> > > > > > > a) rebuild global cache;
> > > > > > > b) rte_smp_wmb();
> > > > > > > c) updating dev_gen
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Feifei Wang
> > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > Ruifeng
> > > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > <nd@arm.com>
> > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory
> > > > > Region
> > > > > > > > cache
> > > > > > > >
> > > > > > > > Hi, Slava
> > > > > > > >
> > > > > > > > Thanks very much for your explanation.
> > > > > > > >
> > > > > > > > I can understand the app can wait all mbufs are returned
> > > > > > > > to the memory pool, and then it can free this mbufs, I
> > > > > > > > agree with
> > this.
> > > > > > > >
> > > > > > > > As a result, I will remove the bug fix patch from this
> > > > > > > > series and just replace the smp barrier with C11 thread
> > > > > > > > fence. Thanks very much for your patient explanation again.
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > >
> > > > > > > > > -----邮件原件-----
> > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory Region cache
> > > > > > > > >
> > > > > > > > > Hi, Feifei
> > > > > > > > >
> > > > > > > > > Please, see below
> > > > > > > > >
> > > > > > > > > ....
> > > > > > > > >
> > > > > > > > > > > Hi, Feifei
> > > > > > > > > > >
> > > > > > > > > > > Sorry, I do not follow what this patch fixes. Do we
> > > > > > > > > > > have some issue/bug with MR cache in practice?
> > > > > > > > > >
> > > > > > > > > > This patch fixes the bug which is based on logical
> > > > > > > > > > deduction, and it doesn't actually happen.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Each Tx queue has its own dedicated "local" cache
> > > > > > > > > > > for MRs to convert buffer address in mbufs being
> > > > > > > > > > > transmitted to LKeys (HW-related entity
> > > > > > > > > > > handle) and the "global" cache for all MR registered
> > > > > > > > > > > on the
> > > > device.
> > > > > > > > > > >
> > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > - if not found:
> > > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > - release lock for global cache
> > > > > > > > > > >
> > > > > > > > > > > How cache update on memory freeing/unregistering
> > happens:
> > > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > - free global cache lock
> > > > > > > > > > >
> > > > > > > > > > > If I understand correctly, your patch swaps [a] and
> > > > > > > > > > > [b], and local caches flush is requested earlier.
> > > > > > > > > > > What problem does it
> > > > solve?
> > > > > > > > > > > It is not supposed there are in datapath some mbufs
> > > > > > > > > > > referencing to the memory being freed. Application
> > > > > > > > > > > must ensure this and must not allocate new mbufs
> > > > > > > > > > > from this memory regions
> > > > > > being freed.
> > > > > > > > > > > Hence, the lookups for these MRs in caches should
> > > > > > > > > > > not
> > occur.
> > > > > > > > > >
> > > > > > > > > > For your first point that, application can take charge
> > > > > > > > > > of preventing MR freed memory being allocated to data path.
> > > > > > > > > >
> > > > > > > > > > Does it means that If there is an emergency of MR
> > > > > > > > > > fragment, such as hotplug, the application must inform
> > > > > > > > > > thedata path in advance, and this memory will not be
> > > > > > > > > > allocated, and then the control path will free this
> > > > > > > > > > memory? If application  can do like this, I agree that
> > > > > > > > > > this bug
> > > > > > > > cannot happen.
> > > > > > > > >
> > > > > > > > > Actually,  this is the only correct way for application to operate.
> > > > > > > > > Let's suppose we have some memory area that application
> > > > > > > > > wants to
> > > > > > free.
> > > > > > > > > ALL references to this area must be removed. If we have
> > > > > > > > > some mbufs allocated from this area, it means that we
> > > > > > > > > have memory pool created
> > > > > > > there.
> > > > > > > > >
> > > > > > > > > What application should do:
> > > > > > > > > - notify all its components/agents the memory area is
> > > > > > > > > going to be freed
> > > > > > > > > - all components/agents free the mbufs they might own
> > > > > > > > > - PMD might not support freeing for some mbufs (for
> > > > > > > > > example being sent and awaiting for completion), so app
> > > > > > > > > should just wait
> > > > > > > > > - wait till all mbufs are returned to the memory pool
> > > > > > > > > (by monitoring available obj == pool size)
> > > > > > > > >
> > > > > > > > > Otherwise - it is dangerous to free the memory. There
> > > > > > > > > are just some mbufs still allocated, it is regardless to
> > > > > > > > > buf address to MR translation. We just can't free the
> > > > > > > > > memory - the mapping will be destroyed and might cause
> > > > > > > > > the segmentation fault by SW or some HW issues on DMA
> > > > > > > > > access to unmapped memory.  It is very generic safety
> > > > > > > > > approach - do not free the memory that is still in
> > > > use.
> > > > > > > > > Hence, at the moment of freeing and unregistering the
> > > > > > > > > MR, there MUST BE NO any
> > > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > For other side, the cache flush has negative effect
> > > > > > > > > > > - the local cache is getting empty and can't provide
> > > > > > > > > > > translation for other valid (not being removed) MRs,
> > > > > > > > > > > and the translation has to look up in the global
> > > > > > > > > > > cache, that is locked now for rebuilding, this
> > > > > > > > > > > causes the delays in datapatch
> > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > >
> > > > > > > > > > If above assumption is true, we can go to your second point.
> > > > > > > > > > I think this is a problem of the tradeoff between
> > > > > > > > > > cache coherence and
> > > > > > > > > performance.
> > > > > > > > > >
> > > > > > > > > > I can understand your meaning that though global cache
> > > > > > > > > > has been changed, we should keep the valid MR in local
> > > > > > > > > > cache as long as possible to ensure the fast searching speed.
> > > > > > > > > > In the meanwhile, the local cache can be rebuilt later
> > > > > > > > > > to reduce its waiting time for acquiring the global cache lock.
> > > > > > > > > >
> > > > > > > > > > However,  this mechanism just ensures the performance
> > > > > > > > > > unchanged for the first few mbufs.
> > > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > > > > Thus, no matter whether there is this patch or not,
> > > > > > > > > > the performance will jitter in a certain period of
> > > > > > > time.
> > > > > > > > >
> > > > > > > > > Local cache should be updated to remove MRs no longer valid.
> > > > > > > > > But we just flush the entire cache.
> > > > > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX
> > > > > > > > > in local
> > > > > cache.
> > > > > > > > > And there are traffic in the datapath for MR0 and MR1,
> > > > > > > > > and no traffic for MRX anymore.
> > > > > > > > >
> > > > > > > > > 1) If we do as you propose:
> > > > > > > > > a) take a lock
> > > > > > > > > b) request flush local cache first - all MR0, MR1, MRX
> > > > > > > > > will be removed on translation in datapath
> > > > > > > > > c) update global cache,
> > > > > > > > > d) free lock
> > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be
> > > > > > > > > blocked on lock taken for cache update since point b) till point
> d).
> > > > > > > > >
> > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > a) take a lock
> > > > > > > > > b) update global cache
> > > > > > > > > c) request flush local cache
> > > > > > > > > d) free lock
> > > > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in
> > > > > > > > > local cache (not happens for MR0 and MR1, must not
> > > > > > > > > happen for MRX), and probability should be minor. And
> > > > > > > > > lock might happen since
> > > > > > > > > c) till
> > > > > > > > > d)
> > > > > > > > > - quite short period of time
> > > > > > > > >
> > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > >
> > > > > > > > > Lock probability:
> > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > > > > >
> > > > > > > > > Lock duration:
> > > > > > > > > - 1) lock since b) till d),
> > > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Finally, in conclusion, I tend to think that the
> > > > > > > > > > bottom layer can do more things to ensure the correct
> > > > > > > > > > execution of the program, which may have a negative
> > > > > > > > > > impact on the performance in a short time, but in the
> > > > > > > > > > long run, the performance will eventually
> > > > > > come back.
> > > > > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > > > > performance in the stable period, and try our best to
> > > > > > > > > > ensure the correctness of the program in case of
> > > > > > > > > emergencies.
> > > > > > > > >
> > > > > > > > > If we have some mbufs still allocated in memory being
> > > > > > > > > freed
> > > > > > > > > - there is nothing to say about correctness, it is
> > > > > > > > > totally incorrect. In my opinion, we should not think
> > > > > > > > > how to mitigate this incorrect behavior, we should not
> > > > > > > > > encourage application developers to follow the wrong
> > > > > > > > approaches.
> > > > > > > > >
> > > > > > > > > With best regards,
> > > > > > > > > Slava
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Feifei
> > > > > > > > > >
> > > > > > > > > > > With best regards,
> > > > > > > > > > > Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-08  3:13                         ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-05-11  8:18                           ` Slava Ovsiienko
  2021-05-12  5:34                             ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-11  8:18 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd

Hi, Feifei

Please, see below.

> -----Original Message-----
> From: Feifei Wang <Feifei.Wang2@arm.com>
> Sent: Saturday, May 8, 2021 6:13
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Thanks for your explanation.
> 
> Thus we can ignore the order between update global cache and update
> dev_gen due to R/W lock.
Yes, exactly.

> 
> Furthermore, it is unnecessary to keep wmb, and the last wmb (1d) I think
> can be removed.
> Two reasons for this:
> 1. wmb has only one function, this is for the local thread to keep the write-
> write order. It cannot ensure write operation above it can be seen by other
> threads.
> 
> 2. rwunlock (1e) has a atomic_release operation in it, I think this release
> operation  is  same as the last wmb : keep order.

Mmmm... In my understanding wmb ensures all memory writings are committed
and visible by other agent. Without committing some writings might be visible
and others not and we would get some inconsistent state.  In other words -
wmb here is rather for consistence, not for order.

With best regards,
Slava


> 
> Best Regards
> Feifei
> 
> > -----邮件原件-----
> > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > 发送时间: 2021年5月7日 18:15
> > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > We should consider the locks in your scenario - it is crucial for the
> > complete model description:
> >
> > How agent_1 (in your terms) rebuilds global cache:
> >
> > 1a) lock()
> > 1b) rebuild(global cache)
> > 1c) update(dev_gen)
> > 1d) wmb()
> > 1e) unlock()
> >
> > How agent_2 checks:
> >
> > 2a) check(dev_gen) (assume positive - changed)
> > 2b) clear(local_cache)
> > 2c) miss(on empty local_cache) - > eventually it goes to
> > mr_lookup_caches()
> > 2d) lock()
> > 2e) get(new MR)
> > 2f) unlock()
> > 2g) update(local cache with obtained new MR)
> >
> > Hence, even if 1c) becomes visible in 2a) before 1b) committed (say,
> > due to out-of-order Arch) - the agent 2 would be blocked on 2d) and
> > scenario depicted on your Fig2 would not happen (agent_2 will wait
> > before step 3 till agent 1 unlocks after its step 5).
> >
> > With best regards,
> > Slava
> >
> > > -----Original Message-----
> > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> > > Shuler <shahafs@nvidia.com>
> > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region cache
> > >
> > > Hi, Slava
> > >
> > > Thanks very much for your reply.
> > >
> > > > -----邮件原件-----
> > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > 发送时间: 2021年5月6日 19:22
> > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > > cache
> > > >
> > > > Hi, Feifei
> > > >
> > > > Sorry, I do not follow why we should get rid of the last (after
> > > > dev_gen update) wmb.
> > > > We've rebuilt the global cache, we should notify other agents it's
> > > > happened and they should flush local caches. So, dev_gen change
> > > > should be made visible to other agents to trigger this activity
> > > > and the second wmb is here to ensure this.
> > >
> > > 1. For the first problem why we should get rid of the last wmb and
> > > move it before dev_gen updated, I think our attention is how the wmb
> > > implements the synchronization between multiple agents.
> > > 					Fig1
> > > --------------------------------------------------------------------
> > > --
> > > ------------------------
> > > -------
> > > Timeslot        		agent_1               		   agent_2
> > > 1          		rebuild global cache
> > > 2                       		wmb
> > > 3            		     update dev_gen ----------------------- load changed
> > > dev_gen
> > > 4                                  			        	           rebuild local cache
> > > --------------------------------------------------------------------
> > > --
> > > ------------------------
> > > -------
> > >
> > > First, wmb is only for local thread to keep the order between local
> > > write- write :
> > > Based on the picture above, for agent_1, wmb keeps the order that
> > > rebuilding global cache is always before updating dev_gen.
> > >
> > > Second, agent_1 communicates with agent_2 by the global variable
> > > "dev_gen" :
> > > If agent_1 updates dev_gen, agent_2 will load it and then it knows
> > > it should rebuild local cache
> > >
> > > Finally, agent_2 rebuilds local cache according to whether agent_1
> > > has rebuilt global cache, and agent_2 knows this information by the
> > > variable
> > "dev_gen".
> > > 					Fig2
> > > --------------------------------------------------------------------
> > > --
> > > ------------------------
> > > -------
> > > Timeslot        		agent_1               		   agent_2
> > > 1		        update dev_gen
> > > 2						      load changed dev_gen
> > > 3						          rebuild local cache
> > > 4        		    rebuild global cache
> > > 5			 wmb
> > > --------------------------------------------------------------------
> > > --
> > > ------------------------
> > > -------
> > >
> > > However, in arm platform, if wmb is after dev_gen updated, "dev_gen"
> > > may be updated before agent_1 rebuilding global cache, then agent_2
> > > maybe receive error message and rebuild its local cache in advance.
> > >
> > > To summarize, it is not important which time other agents can see
> > > the changed global variable "dev_gen".
> > > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is
> > > committed to the global).
> > > It is more important that if other agents see the changed "dev_gen",
> > > they also can know global cache has been updated.
> > >
> > > > One more point, due to registering new/destroying existing MR
> > > > involves FW (via kernel) calls, it takes so many CPU cycles that
> > > > we could neglect wmb overhead at all.
> > >
> > > We just move the last wmb into the right place, and not delete it
> > > for performance.
> > >
> > > >
> > > > Also, regarding this:
> > > >
> > > >  > > Another question suddenly occurred to me, in order to keep
> > > > the
> > > > >
> > > > > order that rebuilding global cache before updating ”dev_gen“,
> > > > > the
> > > > > > wmb should be before updating "dev_gen" rather than after it.
> > > >  > > Otherwise, in the out-of-order platforms, current order
> > > > cannot be
> > > kept.
> > > >
> > > > it is not clear why ordering is important - global cache update
> > > > and dev_gen change happen under spinlock protection, so only the
> > > > last wmb is meaningful.
> > > >
> > >
> > > 2. The second function of wmb before "dev_gen" updated is for
> > > performance according to our previous discussion.
> > > According to Fig2, if there is no wmb between "global cache updated"
> > > and "dev_gen updated", "dev_gen" may update before global cache
> > updated.
> > >
> > > Then agent_2 may see the changed "dev_gen" and flush entire local
> > > cache in advance.
> > >
> > > This entire flush can degrade the performance:
> > > "the local cache is getting empty and can't provide translation for
> > > other valid (not being removed) MRs, and the translation has to look
> > > up in the global cache, that is locked now for rebuilding, this
> > > causes the delays in data path on acquiring global cache lock."
> > >
> > > Furthermore, spinlock is just for global cache, not for dev_gen and
> > > local cache.
> > >
> > > > To summarize, in my opinion:
> > > > - if you see some issue with ordering of global cache
> > > > update/dev_gen signalling,
> > > >   could you, please, elaborate? I'm not sure we should maintain an
> > > > order (due to spinlock protection)
> > > > - the last rte_smp_wmb() after dev_gen incrementing should be kept
> > > > intact
> > > >
> > >
> > > At last, for my view, there are two functions that moving wmb before
> > > "dev_gen"
> > > for the write-write order:
> > > --------------------------------
> > > a) rebuild global cache;
> > > b) rte_smp_wmb();
> > > c) updating dev_gen
> > > --------------------------------
> > > 1. Achieve synchronization between multiple threads in the right way 2.
> > > Prevent other agents from flushing local cache early to ensure
> > > performance
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > With best regards,
> > > > Slava
> > > >
> > > > > -----Original Message-----
> > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > Would you have more comments about this patch?
> > > > > For my sight, only one wmb before "dev_gen" updating is enough
> > > > > to synchronize.
> > > > >
> > > > > Thanks very much for your attention.
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Feifei Wang
> > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region
> > > > > > cache
> > > > > >
> > > > > > Hi, Slava
> > > > > >
> > > > > > I think the second wmb can be removed.
> > > > > > As I know, wmb is just a barrier to keep the order between
> > > > > > write and
> > > > write.
> > > > > > and it cannot tell the CPU when it should commit the changes.
> > > > > >
> > > > > > It is usually used before guard variable to keep the order
> > > > > > that updating guard variable after some changes, which you
> > > > > > want to release,
> > > > > have been done.
> > > > > >
> > > > > > For example, for the wmb  after global cache update/before
> > > > > > altering dev_gen, it can ensure the order that updating global
> > > > > > cache before altering
> > > > > > dev_gen:
> > > > > > 1)If other agent load the changed "dev_gen", it can know the
> > > > > > global cache has been updated.
> > > > > > 2)If other agents load the unchanged, "dev_gen", it means the
> > > > > > global cache has not been updated, and the local cache will
> > > > > > not be
> > > flushed.
> > > > > >
> > > > > > As a result, we use  wmb and guard variable "dev_gen" to
> > > > > > ensure the global cache updating is "visible".
> > > > > > The "visible" means when updating guard variable "dev_gen" is
> > > > > > known by other agents, they also can confirm global cache has
> > > > > > been updated in the meanwhile. Thus, just one wmb before
> > > > > > altering dev_gen can ensure
> > > > > this.
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > <nd@arm.com>;
> > > > nd
> > > > > > > <nd@arm.com>
> > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > > Region cache
> > > > > > >
> > > > > > > Hi, Feifei
> > > > > > >
> > > > > > > In my opinion, there should be 2 barriers:
> > > > > > >  - after global cache update/before altering dev_gen, to
> > > > > > > ensure the correct order
> > > > > > >  - after altering dev_gen to make this change visible for
> > > > > > > other agents and to trigger local cache update
> > > > > > >
> > > > > > > With best regards,
> > > > > > > Slava
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > Ruifeng
> > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > <nd@arm.com>;
> > > > > nd
> > > > > > > > <nd@arm.com>
> > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Slava
> > > > > > > >
> > > > > > > > Another question suddenly occurred to me, in order to keep
> > > > > > > > the order that rebuilding global cache before updating
> > > > > > > > ”dev_gen“, the wmb should be before updating "dev_gen"
> > > > > > > > rather
> > than after it.
> > > > > > > > Otherwise, in the out-of-order platforms, current order
> > > > > > > > cannot be
> > > > kept.
> > > > > > > >
> > > > > > > > Thus, we should change the code as:
> > > > > > > > a) rebuild global cache;
> > > > > > > > b) rte_smp_wmb();
> > > > > > > > c) updating dev_gen
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > > > -----邮件原件-----
> > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > Azrad
> > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > <nd@arm.com>
> > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory
> > > > > > Region
> > > > > > > > > cache
> > > > > > > > >
> > > > > > > > > Hi, Slava
> > > > > > > > >
> > > > > > > > > Thanks very much for your explanation.
> > > > > > > > >
> > > > > > > > > I can understand the app can wait all mbufs are returned
> > > > > > > > > to the memory pool, and then it can free this mbufs, I
> > > > > > > > > agree with
> > > this.
> > > > > > > > >
> > > > > > > > > As a result, I will remove the bug fix patch from this
> > > > > > > > > series and just replace the smp barrier with C11 thread
> > > > > > > > > fence. Thanks very much for your patient explanation again.
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Feifei
> > > > > > > > >
> > > > > > > > > > -----邮件原件-----
> > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> Azrad
> > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > Ruifeng
> > > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > Memory Region cache
> > > > > > > > > >
> > > > > > > > > > Hi, Feifei
> > > > > > > > > >
> > > > > > > > > > Please, see below
> > > > > > > > > >
> > > > > > > > > > ....
> > > > > > > > > >
> > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > >
> > > > > > > > > > > > Sorry, I do not follow what this patch fixes. Do
> > > > > > > > > > > > we have some issue/bug with MR cache in practice?
> > > > > > > > > > >
> > > > > > > > > > > This patch fixes the bug which is based on logical
> > > > > > > > > > > deduction, and it doesn't actually happen.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Each Tx queue has its own dedicated "local" cache
> > > > > > > > > > > > for MRs to convert buffer address in mbufs being
> > > > > > > > > > > > transmitted to LKeys (HW-related entity
> > > > > > > > > > > > handle) and the "global" cache for all MR
> > > > > > > > > > > > registered on the
> > > > > device.
> > > > > > > > > > > >
> > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > - if not found:
> > > > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > >
> > > > > > > > > > > > How cache update on memory freeing/unregistering
> > > happens:
> > > > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > >
> > > > > > > > > > > > If I understand correctly, your patch swaps [a]
> > > > > > > > > > > > and [b], and local caches flush is requested earlier.
> > > > > > > > > > > > What problem does it
> > > > > solve?
> > > > > > > > > > > > It is not supposed there are in datapath some
> > > > > > > > > > > > mbufs referencing to the memory being freed.
> > > > > > > > > > > > Application must ensure this and must not allocate
> > > > > > > > > > > > new mbufs from this memory regions
> > > > > > > being freed.
> > > > > > > > > > > > Hence, the lookups for these MRs in caches should
> > > > > > > > > > > > not
> > > occur.
> > > > > > > > > > >
> > > > > > > > > > > For your first point that, application can take
> > > > > > > > > > > charge of preventing MR freed memory being allocated to
> data path.
> > > > > > > > > > >
> > > > > > > > > > > Does it means that If there is an emergency of MR
> > > > > > > > > > > fragment, such as hotplug, the application must
> > > > > > > > > > > inform thedata path in advance, and this memory will
> > > > > > > > > > > not be allocated, and then the control path will
> > > > > > > > > > > free this memory? If application  can do like this,
> > > > > > > > > > > I agree that this bug
> > > > > > > > > cannot happen.
> > > > > > > > > >
> > > > > > > > > > Actually,  this is the only correct way for application to
> operate.
> > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > application wants to
> > > > > > > free.
> > > > > > > > > > ALL references to this area must be removed. If we
> > > > > > > > > > have some mbufs allocated from this area, it means
> > > > > > > > > > that we have memory pool created
> > > > > > > > there.
> > > > > > > > > >
> > > > > > > > > > What application should do:
> > > > > > > > > > - notify all its components/agents the memory area is
> > > > > > > > > > going to be freed
> > > > > > > > > > - all components/agents free the mbufs they might own
> > > > > > > > > > - PMD might not support freeing for some mbufs (for
> > > > > > > > > > example being sent and awaiting for completion), so
> > > > > > > > > > app should just wait
> > > > > > > > > > - wait till all mbufs are returned to the memory pool
> > > > > > > > > > (by monitoring available obj == pool size)
> > > > > > > > > >
> > > > > > > > > > Otherwise - it is dangerous to free the memory. There
> > > > > > > > > > are just some mbufs still allocated, it is regardless
> > > > > > > > > > to buf address to MR translation. We just can't free
> > > > > > > > > > the memory - the mapping will be destroyed and might
> > > > > > > > > > cause the segmentation fault by SW or some HW issues
> > > > > > > > > > on DMA access to unmapped memory.  It is very generic
> > > > > > > > > > safety approach - do not free the memory that is still
> > > > > > > > > > in
> > > > > use.
> > > > > > > > > > Hence, at the moment of freeing and unregistering the
> > > > > > > > > > MR, there MUST BE NO any
> > > > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > For other side, the cache flush has negative
> > > > > > > > > > > > effect
> > > > > > > > > > > > - the local cache is getting empty and can't
> > > > > > > > > > > > provide translation for other valid (not being
> > > > > > > > > > > > removed) MRs, and the translation has to look up
> > > > > > > > > > > > in the global cache, that is locked now for
> > > > > > > > > > > > rebuilding, this causes the delays in datapatch
> > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > >
> > > > > > > > > > > If above assumption is true, we can go to your second
> point.
> > > > > > > > > > > I think this is a problem of the tradeoff between
> > > > > > > > > > > cache coherence and
> > > > > > > > > > performance.
> > > > > > > > > > >
> > > > > > > > > > > I can understand your meaning that though global
> > > > > > > > > > > cache has been changed, we should keep the valid MR
> > > > > > > > > > > in local cache as long as possible to ensure the fast
> searching speed.
> > > > > > > > > > > In the meanwhile, the local cache can be rebuilt
> > > > > > > > > > > later to reduce its waiting time for acquiring the global
> cache lock.
> > > > > > > > > > >
> > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > > > > > Thus, no matter whether there is this patch or not,
> > > > > > > > > > > the performance will jitter in a certain period of
> > > > > > > > time.
> > > > > > > > > >
> > > > > > > > > > Local cache should be updated to remove MRs no longer
> valid.
> > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > Let's suppose we have valid MR0, MR1, and not valid
> > > > > > > > > > MRX in local
> > > > > > cache.
> > > > > > > > > > And there are traffic in the datapath for MR0 and MR1,
> > > > > > > > > > and no traffic for MRX anymore.
> > > > > > > > > >
> > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > a) take a lock
> > > > > > > > > > b) request flush local cache first - all MR0, MR1, MRX
> > > > > > > > > > will be removed on translation in datapath
> > > > > > > > > > c) update global cache,
> > > > > > > > > > d) free lock
> > > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be
> > > > > > > > > > blocked on lock taken for cache update since point b)
> > > > > > > > > > till point
> > d).
> > > > > > > > > >
> > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > a) take a lock
> > > > > > > > > > b) update global cache
> > > > > > > > > > c) request flush local cache
> > > > > > > > > > d) free lock
> > > > > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing
> > > > > > > > > > in local cache (not happens for MR0 and MR1, must not
> > > > > > > > > > happen for MRX), and probability should be minor. And
> > > > > > > > > > lock might happen since
> > > > > > > > > > c) till
> > > > > > > > > > d)
> > > > > > > > > > - quite short period of time
> > > > > > > > > >
> > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > >
> > > > > > > > > > Lock probability:
> > > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after c)
> > > > > > > > > >
> > > > > > > > > > Lock duration:
> > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Finally, in conclusion, I tend to think that the
> > > > > > > > > > > bottom layer can do more things to ensure the
> > > > > > > > > > > correct execution of the program, which may have a
> > > > > > > > > > > negative impact on the performance in a short time,
> > > > > > > > > > > but in the long run, the performance will eventually
> > > > > > > come back.
> > > > > > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > > > > > performance in the stable period, and try our best
> > > > > > > > > > > to ensure the correctness of the program in case of
> > > > > > > > > > emergencies.
> > > > > > > > > >
> > > > > > > > > > If we have some mbufs still allocated in memory being
> > > > > > > > > > freed
> > > > > > > > > > - there is nothing to say about correctness, it is
> > > > > > > > > > totally incorrect. In my opinion, we should not think
> > > > > > > > > > how to mitigate this incorrect behavior, we should not
> > > > > > > > > > encourage application developers to follow the wrong
> > > > > > > > > approaches.
> > > > > > > > > >
> > > > > > > > > > With best regards,
> > > > > > > > > > Slava
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > > Feifei
> > > > > > > > > > >
> > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-11  8:18                           ` [dpdk-dev] " Slava Ovsiienko
@ 2021-05-12  5:34                             ` Feifei Wang
  2021-05-12 11:07                               ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-12  5:34 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Slava

Please see below.

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月11日 16:19
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Please, see below.
> 
> > -----Original Message-----
> > From: Feifei Wang <Feifei.Wang2@arm.com>
> > Sent: Saturday, May 8, 2021 6:13
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region cache
> >
> > Hi, Slava
> >
> > Thanks for your explanation.
> >
> > Thus we can ignore the order between update global cache and update
> > dev_gen due to R/W lock.
> Yes, exactly.
> 
> >
> > Furthermore, it is unnecessary to keep wmb, and the last wmb (1d) I
> > think can be removed.
> > Two reasons for this:
> > 1. wmb has only one function, this is for the local thread to keep the
> > write- write order. It cannot ensure write operation above it can be
> > seen by other threads.
> >
> > 2. rwunlock (1e) has a atomic_release operation in it, I think this
> > release operation  is  same as the last wmb : keep order.
> 
> Mmmm... In my understanding wmb ensures all memory writings are
> committed and visible by other agent. Without committing some writings
> might be visible and others not and we would get some inconsistent state.  In
> other words - wmb here is rather for consistence, not for order.
> 

If I understand correctly, your meaning is that if without wmb, maybe other agents
observe changed "dev_gen", but they also observe unchanged "global" cache.
This can be defined as  memory inconsistent state.

					Fig1
-----------------------------------------------------------------------------------------------------
Timeslot        	      agent_1               		               agent_2
1		     take_lock
2          		update dev_gen
3                       					     observe changed dev_gen  
4						         clear local cache         		
5            		rebuild global cache		               wait_lock
6		      free_lock
7		         wmb			                take_lock
8							get(new MR)
9                           				             	   free_lock
-----------------------------------------------------------------------------------------------------

1. However, in out-of-order platform, though adding a 'wmb at last', 'dev_gen' maybe
updated before global cache rebuild, and then other agents can observe changed 'dev_ge'
before rebuilding global cache.

Thus, though add a 'wmb at last', It is still unable to prevent other agents from observing
some inconsistent state. As a result, 'wmb at last' fails to keep consistence.

2. On the other hand, due to lock, agent_2 will wait to take a lock until global cache rebuilt
by agent_1, and this ensures agent_2 can get a correct new MR and update new local cache
correctly.

In summary, 'wmb at last' cannot guarantee other agents to observe the consistent state.
But lock can fix this error. So, the existence of wmb at last is redundant and we can remove it.

Best Regards
Feifei

> With best regards,
> Slava
> 
> 
> >
> > Best Regards
> > Feifei
> >
> > > -----邮件原件-----
> > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > 发送时间: 2021年5月7日 18:15
> > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Feifei
> > >
> > > We should consider the locks in your scenario - it is crucial for
> > > the complete model description:
> > >
> > > How agent_1 (in your terms) rebuilds global cache:
> > >
> > > 1a) lock()
> > > 1b) rebuild(global cache)
> > > 1c) update(dev_gen)
> > > 1d) wmb()
> > > 1e) unlock()
> > >
> > > How agent_2 checks:
> > >
> > > 2a) check(dev_gen) (assume positive - changed)
> > > 2b) clear(local_cache)
> > > 2c) miss(on empty local_cache) - > eventually it goes to
> > > mr_lookup_caches()
> > > 2d) lock()
> > > 2e) get(new MR)
> > > 2f) unlock()
> > > 2g) update(local cache with obtained new MR)
> > >
> > > Hence, even if 1c) becomes visible in 2a) before 1b) committed (say,
> > > due to out-of-order Arch) - the agent 2 would be blocked on 2d) and
> > > scenario depicted on your Fig2 would not happen (agent_2 will wait
> > > before step 3 till agent 1 unlocks after its step 5).
> > >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf
> > > > Shuler <shahafs@nvidia.com>
> > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Thanks very much for your reply.
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > 发送时间: 2021年5月6日 19:22
> > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Feifei
> > > > >
> > > > > Sorry, I do not follow why we should get rid of the last (after
> > > > > dev_gen update) wmb.
> > > > > We've rebuilt the global cache, we should notify other agents
> > > > > it's happened and they should flush local caches. So, dev_gen
> > > > > change should be made visible to other agents to trigger this
> > > > > activity and the second wmb is here to ensure this.
> > > >
> > > > 1. For the first problem why we should get rid of the last wmb and
> > > > move it before dev_gen updated, I think our attention is how the
> > > > wmb implements the synchronization between multiple agents.
> > > > 					Fig1
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------
> > > > -------
> > > > Timeslot        		agent_1               		   agent_2
> > > > 1          		rebuild global cache
> > > > 2                       		wmb
> > > > 3            		     update dev_gen ----------------------- load changed
> > > > dev_gen
> > > > 4                                  			        	           rebuild local cache
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------
> > > > -------
> > > >
> > > > First, wmb is only for local thread to keep the order between
> > > > local
> > > > write- write :
> > > > Based on the picture above, for agent_1, wmb keeps the order that
> > > > rebuilding global cache is always before updating dev_gen.
> > > >
> > > > Second, agent_1 communicates with agent_2 by the global variable
> > > > "dev_gen" :
> > > > If agent_1 updates dev_gen, agent_2 will load it and then it knows
> > > > it should rebuild local cache
> > > >
> > > > Finally, agent_2 rebuilds local cache according to whether agent_1
> > > > has rebuilt global cache, and agent_2 knows this information by
> > > > the variable
> > > "dev_gen".
> > > > 					Fig2
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------
> > > > -------
> > > > Timeslot        		agent_1               		   agent_2
> > > > 1		        update dev_gen
> > > > 2						      load changed dev_gen
> > > > 3						          rebuild local cache
> > > > 4        		    rebuild global cache
> > > > 5			 wmb
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------
> > > > -------
> > > >
> > > > However, in arm platform, if wmb is after dev_gen updated, "dev_gen"
> > > > may be updated before agent_1 rebuilding global cache, then
> > > > agent_2 maybe receive error message and rebuild its local cache in
> advance.
> > > >
> > > > To summarize, it is not important which time other agents can see
> > > > the changed global variable "dev_gen".
> > > > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is
> > > > committed to the global).
> > > > It is more important that if other agents see the changed
> > > > "dev_gen", they also can know global cache has been updated.
> > > >
> > > > > One more point, due to registering new/destroying existing MR
> > > > > involves FW (via kernel) calls, it takes so many CPU cycles that
> > > > > we could neglect wmb overhead at all.
> > > >
> > > > We just move the last wmb into the right place, and not delete it
> > > > for performance.
> > > >
> > > > >
> > > > > Also, regarding this:
> > > > >
> > > > >  > > Another question suddenly occurred to me, in order to keep
> > > > > the
> > > > > >
> > > > > > order that rebuilding global cache before updating ”dev_gen“,
> > > > > > the
> > > > > > > wmb should be before updating "dev_gen" rather than after it.
> > > > >  > > Otherwise, in the out-of-order platforms, current order
> > > > > cannot be
> > > > kept.
> > > > >
> > > > > it is not clear why ordering is important - global cache update
> > > > > and dev_gen change happen under spinlock protection, so only the
> > > > > last wmb is meaningful.
> > > > >
> > > >
> > > > 2. The second function of wmb before "dev_gen" updated is for
> > > > performance according to our previous discussion.
> > > > According to Fig2, if there is no wmb between "global cache updated"
> > > > and "dev_gen updated", "dev_gen" may update before global cache
> > > updated.
> > > >
> > > > Then agent_2 may see the changed "dev_gen" and flush entire local
> > > > cache in advance.
> > > >
> > > > This entire flush can degrade the performance:
> > > > "the local cache is getting empty and can't provide translation
> > > > for other valid (not being removed) MRs, and the translation has
> > > > to look up in the global cache, that is locked now for rebuilding,
> > > > this causes the delays in data path on acquiring global cache lock."
> > > >
> > > > Furthermore, spinlock is just for global cache, not for dev_gen
> > > > and local cache.
> > > >
> > > > > To summarize, in my opinion:
> > > > > - if you see some issue with ordering of global cache
> > > > > update/dev_gen signalling,
> > > > >   could you, please, elaborate? I'm not sure we should maintain
> > > > > an order (due to spinlock protection)
> > > > > - the last rte_smp_wmb() after dev_gen incrementing should be
> > > > > kept intact
> > > > >
> > > >
> > > > At last, for my view, there are two functions that moving wmb
> > > > before "dev_gen"
> > > > for the write-write order:
> > > > --------------------------------
> > > > a) rebuild global cache;
> > > > b) rte_smp_wmb();
> > > > c) updating dev_gen
> > > > --------------------------------
> > > > 1. Achieve synchronization between multiple threads in the right way 2.
> > > > Prevent other agents from flushing local cache early to ensure
> > > > performance
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > Memory Region cache
> > > > > >
> > > > > > Hi, Slava
> > > > > >
> > > > > > Would you have more comments about this patch?
> > > > > > For my sight, only one wmb before "dev_gen" updating is enough
> > > > > > to synchronize.
> > > > > >
> > > > > > Thanks very much for your attention.
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Feifei Wang
> > > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region
> > > > > > > cache
> > > > > > >
> > > > > > > Hi, Slava
> > > > > > >
> > > > > > > I think the second wmb can be removed.
> > > > > > > As I know, wmb is just a barrier to keep the order between
> > > > > > > write and
> > > > > write.
> > > > > > > and it cannot tell the CPU when it should commit the changes.
> > > > > > >
> > > > > > > It is usually used before guard variable to keep the order
> > > > > > > that updating guard variable after some changes, which you
> > > > > > > want to release,
> > > > > > have been done.
> > > > > > >
> > > > > > > For example, for the wmb  after global cache update/before
> > > > > > > altering dev_gen, it can ensure the order that updating
> > > > > > > global cache before altering
> > > > > > > dev_gen:
> > > > > > > 1)If other agent load the changed "dev_gen", it can know the
> > > > > > > global cache has been updated.
> > > > > > > 2)If other agents load the unchanged, "dev_gen", it means
> > > > > > > the global cache has not been updated, and the local cache
> > > > > > > will not be
> > > > flushed.
> > > > > > >
> > > > > > > As a result, we use  wmb and guard variable "dev_gen" to
> > > > > > > ensure the global cache updating is "visible".
> > > > > > > The "visible" means when updating guard variable "dev_gen"
> > > > > > > is known by other agents, they also can confirm global cache
> > > > > > > has been updated in the meanwhile. Thus, just one wmb before
> > > > > > > altering dev_gen can ensure
> > > > > > this.
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > Ruifeng
> > > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > <nd@arm.com>;
> > > > > nd
> > > > > > > > <nd@arm.com>
> > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Feifei
> > > > > > > >
> > > > > > > > In my opinion, there should be 2 barriers:
> > > > > > > >  - after global cache update/before altering dev_gen, to
> > > > > > > > ensure the correct order
> > > > > > > >  - after altering dev_gen to make this change visible for
> > > > > > > > other agents and to trigger local cache update
> > > > > > > >
> > > > > > > > With best regards,
> > > > > > > > Slava
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > > > > > > > Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > > Ruifeng
> > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > <nd@arm.com>;
> > > > > > nd
> > > > > > > > > <nd@arm.com>
> > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > for Memory Region cache
> > > > > > > > >
> > > > > > > > > Hi, Slava
> > > > > > > > >
> > > > > > > > > Another question suddenly occurred to me, in order to
> > > > > > > > > keep the order that rebuilding global cache before
> > > > > > > > > updating ”dev_gen“, the wmb should be before updating
> "dev_gen"
> > > > > > > > > rather
> > > than after it.
> > > > > > > > > Otherwise, in the out-of-order platforms, current order
> > > > > > > > > cannot be
> > > > > kept.
> > > > > > > > >
> > > > > > > > > Thus, we should change the code as:
> > > > > > > > > a) rebuild global cache;
> > > > > > > > > b) rte_smp_wmb();
> > > > > > > > > c) updating dev_gen
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Feifei
> > > > > > > > > > -----邮件原件-----
> > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > Azrad
> > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > Ruifeng
> > > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > <nd@arm.com>
> > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > Memory
> > > > > > > Region
> > > > > > > > > > cache
> > > > > > > > > >
> > > > > > > > > > Hi, Slava
> > > > > > > > > >
> > > > > > > > > > Thanks very much for your explanation.
> > > > > > > > > >
> > > > > > > > > > I can understand the app can wait all mbufs are
> > > > > > > > > > returned to the memory pool, and then it can free this
> > > > > > > > > > mbufs, I agree with
> > > > this.
> > > > > > > > > >
> > > > > > > > > > As a result, I will remove the bug fix patch from this
> > > > > > > > > > series and just replace the smp barrier with C11
> > > > > > > > > > thread fence. Thanks very much for your patient explanation
> again.
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Feifei
> > > > > > > > > >
> > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> > Azrad
> > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> stable@dpdk.org;
> > > > > > Ruifeng
> > > > > > > > > Wang
> > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > > Memory Region cache
> > > > > > > > > > >
> > > > > > > > > > > Hi, Feifei
> > > > > > > > > > >
> > > > > > > > > > > Please, see below
> > > > > > > > > > >
> > > > > > > > > > > ....
> > > > > > > > > > >
> > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > >
> > > > > > > > > > > > > Sorry, I do not follow what this patch fixes. Do
> > > > > > > > > > > > > we have some issue/bug with MR cache in practice?
> > > > > > > > > > > >
> > > > > > > > > > > > This patch fixes the bug which is based on logical
> > > > > > > > > > > > deduction, and it doesn't actually happen.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Each Tx queue has its own dedicated "local"
> > > > > > > > > > > > > cache for MRs to convert buffer address in mbufs
> > > > > > > > > > > > > being transmitted to LKeys (HW-related entity
> > > > > > > > > > > > > handle) and the "global" cache for all MR
> > > > > > > > > > > > > registered on the
> > > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > > - if not found:
> > > > > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > > >
> > > > > > > > > > > > > How cache update on memory freeing/unregistering
> > > > happens:
> > > > > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > > > > - [a] remove relevant MRs from the global cache
> > > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > > >
> > > > > > > > > > > > > If I understand correctly, your patch swaps [a]
> > > > > > > > > > > > > and [b], and local caches flush is requested earlier.
> > > > > > > > > > > > > What problem does it
> > > > > > solve?
> > > > > > > > > > > > > It is not supposed there are in datapath some
> > > > > > > > > > > > > mbufs referencing to the memory being freed.
> > > > > > > > > > > > > Application must ensure this and must not
> > > > > > > > > > > > > allocate new mbufs from this memory regions
> > > > > > > > being freed.
> > > > > > > > > > > > > Hence, the lookups for these MRs in caches
> > > > > > > > > > > > > should not
> > > > occur.
> > > > > > > > > > > >
> > > > > > > > > > > > For your first point that, application can take
> > > > > > > > > > > > charge of preventing MR freed memory being
> > > > > > > > > > > > allocated to
> > data path.
> > > > > > > > > > > >
> > > > > > > > > > > > Does it means that If there is an emergency of MR
> > > > > > > > > > > > fragment, such as hotplug, the application must
> > > > > > > > > > > > inform thedata path in advance, and this memory
> > > > > > > > > > > > will not be allocated, and then the control path
> > > > > > > > > > > > will free this memory? If application  can do like
> > > > > > > > > > > > this, I agree that this bug
> > > > > > > > > > cannot happen.
> > > > > > > > > > >
> > > > > > > > > > > Actually,  this is the only correct way for
> > > > > > > > > > > application to
> > operate.
> > > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > > application wants to
> > > > > > > > free.
> > > > > > > > > > > ALL references to this area must be removed. If we
> > > > > > > > > > > have some mbufs allocated from this area, it means
> > > > > > > > > > > that we have memory pool created
> > > > > > > > > there.
> > > > > > > > > > >
> > > > > > > > > > > What application should do:
> > > > > > > > > > > - notify all its components/agents the memory area
> > > > > > > > > > > is going to be freed
> > > > > > > > > > > - all components/agents free the mbufs they might
> > > > > > > > > > > own
> > > > > > > > > > > - PMD might not support freeing for some mbufs (for
> > > > > > > > > > > example being sent and awaiting for completion), so
> > > > > > > > > > > app should just wait
> > > > > > > > > > > - wait till all mbufs are returned to the memory
> > > > > > > > > > > pool (by monitoring available obj == pool size)
> > > > > > > > > > >
> > > > > > > > > > > Otherwise - it is dangerous to free the memory.
> > > > > > > > > > > There are just some mbufs still allocated, it is
> > > > > > > > > > > regardless to buf address to MR translation. We just
> > > > > > > > > > > can't free the memory - the mapping will be
> > > > > > > > > > > destroyed and might cause the segmentation fault by
> > > > > > > > > > > SW or some HW issues on DMA access to unmapped
> > > > > > > > > > > memory.  It is very generic safety approach - do not
> > > > > > > > > > > free the memory that is still in
> > > > > > use.
> > > > > > > > > > > Hence, at the moment of freeing and unregistering
> > > > > > > > > > > the MR, there MUST BE NO any
> > > > > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > For other side, the cache flush has negative
> > > > > > > > > > > > > effect
> > > > > > > > > > > > > - the local cache is getting empty and can't
> > > > > > > > > > > > > provide translation for other valid (not being
> > > > > > > > > > > > > removed) MRs, and the translation has to look up
> > > > > > > > > > > > > in the global cache, that is locked now for
> > > > > > > > > > > > > rebuilding, this causes the delays in datapatch
> > > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > > >
> > > > > > > > > > > > If above assumption is true, we can go to your
> > > > > > > > > > > > second
> > point.
> > > > > > > > > > > > I think this is a problem of the tradeoff between
> > > > > > > > > > > > cache coherence and
> > > > > > > > > > > performance.
> > > > > > > > > > > >
> > > > > > > > > > > > I can understand your meaning that though global
> > > > > > > > > > > > cache has been changed, we should keep the valid
> > > > > > > > > > > > MR in local cache as long as possible to ensure
> > > > > > > > > > > > the fast
> > searching speed.
> > > > > > > > > > > > In the meanwhile, the local cache can be rebuilt
> > > > > > > > > > > > later to reduce its waiting time for acquiring the
> > > > > > > > > > > > global
> > cache lock.
> > > > > > > > > > > >
> > > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > > > > And the performance can firstly reduce and then returns.
> > > > > > > > > > > > Thus, no matter whether there is this patch or
> > > > > > > > > > > > not, the performance will jitter in a certain
> > > > > > > > > > > > period of
> > > > > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > Local cache should be updated to remove MRs no
> > > > > > > > > > > longer
> > valid.
> > > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > > Let's suppose we have valid MR0, MR1, and not valid
> > > > > > > > > > > MRX in local
> > > > > > > cache.
> > > > > > > > > > > And there are traffic in the datapath for MR0 and
> > > > > > > > > > > MR1, and no traffic for MRX anymore.
> > > > > > > > > > >
> > > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > > a) take a lock
> > > > > > > > > > > b) request flush local cache first - all MR0, MR1,
> > > > > > > > > > > MRX will be removed on translation in datapath
> > > > > > > > > > > c) update global cache,
> > > > > > > > > > > d) free lock
> > > > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be
> > > > > > > > > > > blocked on lock taken for cache update since point
> > > > > > > > > > > b) till point
> > > d).
> > > > > > > > > > >
> > > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > > a) take a lock
> > > > > > > > > > > b) update global cache
> > > > > > > > > > > c) request flush local cache
> > > > > > > > > > > d) free lock
> > > > > > > > > > > The traffic MIGHT be locked ONLY for MRs
> > > > > > > > > > > non-existing in local cache (not happens for MR0 and
> > > > > > > > > > > MR1, must not happen for MRX), and probability
> > > > > > > > > > > should be minor. And lock might happen since
> > > > > > > > > > > c) till
> > > > > > > > > > > d)
> > > > > > > > > > > - quite short period of time
> > > > > > > > > > >
> > > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > > >
> > > > > > > > > > > Lock probability:
> > > > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after
> > > > > > > > > > > c)
> > > > > > > > > > >
> > > > > > > > > > > Lock duration:
> > > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Finally, in conclusion, I tend to think that the
> > > > > > > > > > > > bottom layer can do more things to ensure the
> > > > > > > > > > > > correct execution of the program, which may have a
> > > > > > > > > > > > negative impact on the performance in a short
> > > > > > > > > > > > time, but in the long run, the performance will
> > > > > > > > > > > > eventually
> > > > > > > > come back.
> > > > > > > > > > > > Furthermore, maybe we should pay attention to the
> > > > > > > > > > > > performance in the stable period, and try our best
> > > > > > > > > > > > to ensure the correctness of the program in case
> > > > > > > > > > > > of
> > > > > > > > > > > emergencies.
> > > > > > > > > > >
> > > > > > > > > > > If we have some mbufs still allocated in memory
> > > > > > > > > > > being freed
> > > > > > > > > > > - there is nothing to say about correctness, it is
> > > > > > > > > > > totally incorrect. In my opinion, we should not
> > > > > > > > > > > think how to mitigate this incorrect behavior, we
> > > > > > > > > > > should not encourage application developers to
> > > > > > > > > > > follow the wrong
> > > > > > > > > > approaches.
> > > > > > > > > > >
> > > > > > > > > > > With best regards,
> > > > > > > > > > > Slava
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards
> > > > > > > > > > > > Feifei
> > > > > > > > > > > >
> > > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-12  5:34                             ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-05-12 11:07                               ` Slava Ovsiienko
  2021-05-13  5:49                                 ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-12 11:07 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, stable, Ruifeng Wang, nd

Hi, Feifei

..ship..
> 
> If I understand correctly, your meaning is that if without wmb, maybe other
> agents observe changed "dev_gen", but they also observe unchanged
> "global" cache.
> This can be defined as  memory inconsistent state.
> 
> 					Fig1
> ----------------------------------------------------------------------------------------------
> -------
> Timeslot        	      agent_1               		               agent_2
> 1		     take_lock
> 2          		update dev_gen
> 3                       					     observe changed dev_gen
> 4						         clear local cache
> 
> 5            		rebuild global cache		               wait_lock
> 6		      free_lock
> 7		         wmb			                take_lock
> 8							get(new MR)
> 9                           				             	   free_lock
> ----------------------------------------------------------------------------------------------
> -------

Yes, something like that.

 
> 1. However, in out-of-order platform, though adding a 'wmb at last',
> 'dev_gen' maybe updated before global cache rebuild, and then other
> agents can observe changed 'dev_ge'
> before rebuilding global cache.
> 
> Thus, though add a 'wmb at last', It is still unable to prevent other agents
> from observing some inconsistent state. As a result, 'wmb at last' fails to
> keep consistence.
> 
> 2. On the other hand, due to lock, agent_2 will wait to take a lock until global
> cache rebuilt by agent_1, and this ensures agent_2 can get a correct new MR
> and update new local cache correctly.
> 
> In summary, 'wmb at last' cannot guarantee other agents to observe the
> consistent state.
> But lock can fix this error. So, the existence of wmb at last is redundant and
> we can remove it.

If dev_gen change is committed and cache's one is not yet - the agent_2 might see
inconsistent state even inside the lock-protected section. Hence, we must commit
all writes before leaving the locked section in agent_1.

Let’s suppose there is no wmb in agent_1 at all, and dev_gen is arbitrary committed
by CPU and MR cache data change is not. We leave the locked section in agent_1, agent_2
sees dev_gen changed, takes the lock and sees inconsistent MR-cache state due to not all changes
made in agent_1 are committed. With wmb we have now in the existing code - there is no issue like that.

With best regards,
Slava

> 
> Best Regards
> Feifei
> 
> > With best regards,
> > Slava
> >
> >
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > -----邮件原件-----
> > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > 发送时间: 2021年5月7日 18:15
> > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > > cache
> > > >
> > > > Hi, Feifei
> > > >
> > > > We should consider the locks in your scenario - it is crucial for
> > > > the complete model description:
> > > >
> > > > How agent_1 (in your terms) rebuilds global cache:
> > > >
> > > > 1a) lock()
> > > > 1b) rebuild(global cache)
> > > > 1c) update(dev_gen)
> > > > 1d) wmb()
> > > > 1e) unlock()
> > > >
> > > > How agent_2 checks:
> > > >
> > > > 2a) check(dev_gen) (assume positive - changed)
> > > > 2b) clear(local_cache)
> > > > 2c) miss(on empty local_cache) - > eventually it goes to
> > > > mr_lookup_caches()
> > > > 2d) lock()
> > > > 2e) get(new MR)
> > > > 2f) unlock()
> > > > 2g) update(local cache with obtained new MR)
> > > >
> > > > Hence, even if 1c) becomes visible in 2a) before 1b) committed
> > > > (say, due to out-of-order Arch) - the agent 2 would be blocked on
> > > > 2d) and scenario depicted on your Fig2 would not happen (agent_2
> > > > will wait before step 3 till agent 1 unlocks after its step 5).
> > > >
> > > > With best regards,
> > > > Slava
> > > >
> > > > > -----Original Message-----
> > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> Shahaf
> > > > > Shuler <shahafs@nvidia.com>
> > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Slava
> > > > >
> > > > > Thanks very much for your reply.
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > 发送时间: 2021年5月6日 19:22
> > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > Sorry, I do not follow why we should get rid of the last
> > > > > > (after dev_gen update) wmb.
> > > > > > We've rebuilt the global cache, we should notify other agents
> > > > > > it's happened and they should flush local caches. So, dev_gen
> > > > > > change should be made visible to other agents to trigger this
> > > > > > activity and the second wmb is here to ensure this.
> > > > >
> > > > > 1. For the first problem why we should get rid of the last wmb
> > > > > and move it before dev_gen updated, I think our attention is how
> > > > > the wmb implements the synchronization between multiple agents.
> > > > > 					Fig1
> > > > > ----------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --
> > > > > ------------------------
> > > > > -------
> > > > > Timeslot        		agent_1               		   agent_2
> > > > > 1          		rebuild global cache
> > > > > 2                       		wmb
> > > > > 3            		     update dev_gen ----------------------- load changed
> > > > > dev_gen
> > > > > 4                                  			        	           rebuild local
> cache
> > > > > ----------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --
> > > > > ------------------------
> > > > > -------
> > > > >
> > > > > First, wmb is only for local thread to keep the order between
> > > > > local
> > > > > write- write :
> > > > > Based on the picture above, for agent_1, wmb keeps the order
> > > > > that rebuilding global cache is always before updating dev_gen.
> > > > >
> > > > > Second, agent_1 communicates with agent_2 by the global variable
> > > > > "dev_gen" :
> > > > > If agent_1 updates dev_gen, agent_2 will load it and then it
> > > > > knows it should rebuild local cache
> > > > >
> > > > > Finally, agent_2 rebuilds local cache according to whether
> > > > > agent_1 has rebuilt global cache, and agent_2 knows this
> > > > > information by the variable
> > > > "dev_gen".
> > > > > 					Fig2
> > > > > ----------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --
> > > > > ------------------------
> > > > > -------
> > > > > Timeslot        		agent_1               		   agent_2
> > > > > 1		        update dev_gen
> > > > > 2						      load changed
> dev_gen
> > > > > 3						          rebuild local
> cache
> > > > > 4        		    rebuild global cache
> > > > > 5			 wmb
> > > > > ----------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --
> > > > > ------------------------
> > > > > -------
> > > > >
> > > > > However, in arm platform, if wmb is after dev_gen updated,
> "dev_gen"
> > > > > may be updated before agent_1 rebuilding global cache, then
> > > > > agent_2 maybe receive error message and rebuild its local cache
> > > > > in
> > advance.
> > > > >
> > > > > To summarize, it is not important which time other agents can
> > > > > see the changed global variable "dev_gen".
> > > > > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen"
> > > > > is committed to the global).
> > > > > It is more important that if other agents see the changed
> > > > > "dev_gen", they also can know global cache has been updated.
> > > > >
> > > > > > One more point, due to registering new/destroying existing MR
> > > > > > involves FW (via kernel) calls, it takes so many CPU cycles
> > > > > > that we could neglect wmb overhead at all.
> > > > >
> > > > > We just move the last wmb into the right place, and not delete
> > > > > it for performance.
> > > > >
> > > > > >
> > > > > > Also, regarding this:
> > > > > >
> > > > > >  > > Another question suddenly occurred to me, in order to
> > > > > > keep the
> > > > > > >
> > > > > > > order that rebuilding global cache before updating
> > > > > > > ”dev_gen“, the
> > > > > > > > wmb should be before updating "dev_gen" rather than after it.
> > > > > >  > > Otherwise, in the out-of-order platforms, current order
> > > > > > cannot be
> > > > > kept.
> > > > > >
> > > > > > it is not clear why ordering is important - global cache
> > > > > > update and dev_gen change happen under spinlock protection, so
> > > > > > only the last wmb is meaningful.
> > > > > >
> > > > >
> > > > > 2. The second function of wmb before "dev_gen" updated is for
> > > > > performance according to our previous discussion.
> > > > > According to Fig2, if there is no wmb between "global cache updated"
> > > > > and "dev_gen updated", "dev_gen" may update before global cache
> > > > updated.
> > > > >
> > > > > Then agent_2 may see the changed "dev_gen" and flush entire
> > > > > local cache in advance.
> > > > >
> > > > > This entire flush can degrade the performance:
> > > > > "the local cache is getting empty and can't provide translation
> > > > > for other valid (not being removed) MRs, and the translation has
> > > > > to look up in the global cache, that is locked now for
> > > > > rebuilding, this causes the delays in data path on acquiring global
> cache lock."
> > > > >
> > > > > Furthermore, spinlock is just for global cache, not for dev_gen
> > > > > and local cache.
> > > > >
> > > > > > To summarize, in my opinion:
> > > > > > - if you see some issue with ordering of global cache
> > > > > > update/dev_gen signalling,
> > > > > >   could you, please, elaborate? I'm not sure we should
> > > > > > maintain an order (due to spinlock protection)
> > > > > > - the last rte_smp_wmb() after dev_gen incrementing should be
> > > > > > kept intact
> > > > > >
> > > > >
> > > > > At last, for my view, there are two functions that moving wmb
> > > > > before "dev_gen"
> > > > > for the write-write order:
> > > > > --------------------------------
> > > > > a) rebuild global cache;
> > > > > b) rte_smp_wmb();
> > > > > c) updating dev_gen
> > > > > -------------------------------- 1. Achieve synchronization
> > > > > between multiple threads in the right way 2.
> > > > > Prevent other agents from flushing local cache early to ensure
> > > > > performance
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > Memory Region cache
> > > > > > >
> > > > > > > Hi, Slava
> > > > > > >
> > > > > > > Would you have more comments about this patch?
> > > > > > > For my sight, only one wmb before "dev_gen" updating is
> > > > > > > enough to synchronize.
> > > > > > >
> > > > > > > Thanks very much for your attention.
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Feifei Wang
> > > > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > Ruifeng
> > > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory
> > > > > Region
> > > > > > > > cache
> > > > > > > >
> > > > > > > > Hi, Slava
> > > > > > > >
> > > > > > > > I think the second wmb can be removed.
> > > > > > > > As I know, wmb is just a barrier to keep the order between
> > > > > > > > write and
> > > > > > write.
> > > > > > > > and it cannot tell the CPU when it should commit the changes.
> > > > > > > >
> > > > > > > > It is usually used before guard variable to keep the order
> > > > > > > > that updating guard variable after some changes, which you
> > > > > > > > want to release,
> > > > > > > have been done.
> > > > > > > >
> > > > > > > > For example, for the wmb  after global cache update/before
> > > > > > > > altering dev_gen, it can ensure the order that updating
> > > > > > > > global cache before altering
> > > > > > > > dev_gen:
> > > > > > > > 1)If other agent load the changed "dev_gen", it can know
> > > > > > > > the global cache has been updated.
> > > > > > > > 2)If other agents load the unchanged, "dev_gen", it means
> > > > > > > > the global cache has not been updated, and the local cache
> > > > > > > > will not be
> > > > > flushed.
> > > > > > > >
> > > > > > > > As a result, we use  wmb and guard variable "dev_gen" to
> > > > > > > > ensure the global cache updating is "visible".
> > > > > > > > The "visible" means when updating guard variable "dev_gen"
> > > > > > > > is known by other agents, they also can confirm global
> > > > > > > > cache has been updated in the meanwhile. Thus, just one
> > > > > > > > wmb before altering dev_gen can ensure
> > > > > > > this.
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > >
> > > > > > > > > -----邮件原件-----
> > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > <nd@arm.com>;
> > > > > > nd
> > > > > > > > > <nd@arm.com>
> > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory Region cache
> > > > > > > > >
> > > > > > > > > Hi, Feifei
> > > > > > > > >
> > > > > > > > > In my opinion, there should be 2 barriers:
> > > > > > > > >  - after global cache update/before altering dev_gen, to
> > > > > > > > > ensure the correct order
> > > > > > > > >  - after altering dev_gen to make this change visible
> > > > > > > > > for other agents and to trigger local cache update
> > > > > > > > >
> > > > > > > > > With best regards,
> > > > > > > > > Slava
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > > > > > > > > Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > <nd@arm.com>;
> > > > > > > nd
> > > > > > > > > > <nd@arm.com>
> > > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > for Memory Region cache
> > > > > > > > > >
> > > > > > > > > > Hi, Slava
> > > > > > > > > >
> > > > > > > > > > Another question suddenly occurred to me, in order to
> > > > > > > > > > keep the order that rebuilding global cache before
> > > > > > > > > > updating ”dev_gen“, the wmb should be before updating
> > "dev_gen"
> > > > > > > > > > rather
> > > > than after it.
> > > > > > > > > > Otherwise, in the out-of-order platforms, current
> > > > > > > > > > order cannot be
> > > > > > kept.
> > > > > > > > > >
> > > > > > > > > > Thus, we should change the code as:
> > > > > > > > > > a) rebuild global cache;
> > > > > > > > > > b) rte_smp_wmb();
> > > > > > > > > > c) updating dev_gen
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Feifei
> > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>;
> Matan
> > > > Azrad
> > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> stable@dpdk.org;
> > > > > > Ruifeng
> > > > > > > > > Wang
> > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > <nd@arm.com>
> > > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > > Memory
> > > > > > > > Region
> > > > > > > > > > > cache
> > > > > > > > > > >
> > > > > > > > > > > Hi, Slava
> > > > > > > > > > >
> > > > > > > > > > > Thanks very much for your explanation.
> > > > > > > > > > >
> > > > > > > > > > > I can understand the app can wait all mbufs are
> > > > > > > > > > > returned to the memory pool, and then it can free
> > > > > > > > > > > this mbufs, I agree with
> > > > > this.
> > > > > > > > > > >
> > > > > > > > > > > As a result, I will remove the bug fix patch from
> > > > > > > > > > > this series and just replace the smp barrier with
> > > > > > > > > > > C11 thread fence. Thanks very much for your patient
> > > > > > > > > > > explanation
> > again.
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > > Feifei
> > > > > > > > > > >
> > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> > > Azrad
> > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > stable@dpdk.org;
> > > > > > > Ruifeng
> > > > > > > > > > Wang
> > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > > > for Memory Region cache
> > > > > > > > > > > >
> > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > >
> > > > > > > > > > > > Please, see below
> > > > > > > > > > > >
> > > > > > > > > > > > ....
> > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes.
> > > > > > > > > > > > > > Do we have some issue/bug with MR cache in
> practice?
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch fixes the bug which is based on
> > > > > > > > > > > > > logical deduction, and it doesn't actually happen.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Each Tx queue has its own dedicated "local"
> > > > > > > > > > > > > > cache for MRs to convert buffer address in
> > > > > > > > > > > > > > mbufs being transmitted to LKeys (HW-related
> > > > > > > > > > > > > > entity
> > > > > > > > > > > > > > handle) and the "global" cache for all MR
> > > > > > > > > > > > > > registered on the
> > > > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > > > - if not found:
> > > > > > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > How cache update on memory
> > > > > > > > > > > > > > freeing/unregistering
> > > > > happens:
> > > > > > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > > > > > - [a] remove relevant MRs from the global
> > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If I understand correctly, your patch swaps
> > > > > > > > > > > > > > [a] and [b], and local caches flush is requested earlier.
> > > > > > > > > > > > > > What problem does it
> > > > > > > solve?
> > > > > > > > > > > > > > It is not supposed there are in datapath some
> > > > > > > > > > > > > > mbufs referencing to the memory being freed.
> > > > > > > > > > > > > > Application must ensure this and must not
> > > > > > > > > > > > > > allocate new mbufs from this memory regions
> > > > > > > > > being freed.
> > > > > > > > > > > > > > Hence, the lookups for these MRs in caches
> > > > > > > > > > > > > > should not
> > > > > occur.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For your first point that, application can take
> > > > > > > > > > > > > charge of preventing MR freed memory being
> > > > > > > > > > > > > allocated to
> > > data path.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Does it means that If there is an emergency of
> > > > > > > > > > > > > MR fragment, such as hotplug, the application
> > > > > > > > > > > > > must inform thedata path in advance, and this
> > > > > > > > > > > > > memory will not be allocated, and then the
> > > > > > > > > > > > > control path will free this memory? If
> > > > > > > > > > > > > application  can do like this, I agree that this
> > > > > > > > > > > > > bug
> > > > > > > > > > > cannot happen.
> > > > > > > > > > > >
> > > > > > > > > > > > Actually,  this is the only correct way for
> > > > > > > > > > > > application to
> > > operate.
> > > > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > > > application wants to
> > > > > > > > > free.
> > > > > > > > > > > > ALL references to this area must be removed. If we
> > > > > > > > > > > > have some mbufs allocated from this area, it means
> > > > > > > > > > > > that we have memory pool created
> > > > > > > > > > there.
> > > > > > > > > > > >
> > > > > > > > > > > > What application should do:
> > > > > > > > > > > > - notify all its components/agents the memory area
> > > > > > > > > > > > is going to be freed
> > > > > > > > > > > > - all components/agents free the mbufs they might
> > > > > > > > > > > > own
> > > > > > > > > > > > - PMD might not support freeing for some mbufs
> > > > > > > > > > > > (for example being sent and awaiting for
> > > > > > > > > > > > completion), so app should just wait
> > > > > > > > > > > > - wait till all mbufs are returned to the memory
> > > > > > > > > > > > pool (by monitoring available obj == pool size)
> > > > > > > > > > > >
> > > > > > > > > > > > Otherwise - it is dangerous to free the memory.
> > > > > > > > > > > > There are just some mbufs still allocated, it is
> > > > > > > > > > > > regardless to buf address to MR translation. We
> > > > > > > > > > > > just can't free the memory - the mapping will be
> > > > > > > > > > > > destroyed and might cause the segmentation fault
> > > > > > > > > > > > by SW or some HW issues on DMA access to unmapped
> > > > > > > > > > > > memory.  It is very generic safety approach - do
> > > > > > > > > > > > not free the memory that is still in
> > > > > > > use.
> > > > > > > > > > > > Hence, at the moment of freeing and unregistering
> > > > > > > > > > > > the MR, there MUST BE NO any
> > > > > > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > For other side, the cache flush has negative
> > > > > > > > > > > > > > effect
> > > > > > > > > > > > > > - the local cache is getting empty and can't
> > > > > > > > > > > > > > provide translation for other valid (not being
> > > > > > > > > > > > > > removed) MRs, and the translation has to look
> > > > > > > > > > > > > > up in the global cache, that is locked now for
> > > > > > > > > > > > > > rebuilding, this causes the delays in
> > > > > > > > > > > > > > datapatch
> > > > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If above assumption is true, we can go to your
> > > > > > > > > > > > > second
> > > point.
> > > > > > > > > > > > > I think this is a problem of the tradeoff
> > > > > > > > > > > > > between cache coherence and
> > > > > > > > > > > > performance.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I can understand your meaning that though global
> > > > > > > > > > > > > cache has been changed, we should keep the valid
> > > > > > > > > > > > > MR in local cache as long as possible to ensure
> > > > > > > > > > > > > the fast
> > > searching speed.
> > > > > > > > > > > > > In the meanwhile, the local cache can be rebuilt
> > > > > > > > > > > > > later to reduce its waiting time for acquiring
> > > > > > > > > > > > > the global
> > > cache lock.
> > > > > > > > > > > > >
> > > > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > > > > > And the performance can firstly reduce and then
> returns.
> > > > > > > > > > > > > Thus, no matter whether there is this patch or
> > > > > > > > > > > > > not, the performance will jitter in a certain
> > > > > > > > > > > > > period of
> > > > > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > Local cache should be updated to remove MRs no
> > > > > > > > > > > > longer
> > > valid.
> > > > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > > > Let's suppose we have valid MR0, MR1, and not
> > > > > > > > > > > > valid MRX in local
> > > > > > > > cache.
> > > > > > > > > > > > And there are traffic in the datapath for MR0 and
> > > > > > > > > > > > MR1, and no traffic for MRX anymore.
> > > > > > > > > > > >
> > > > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > b) request flush local cache first - all MR0, MR1,
> > > > > > > > > > > > MRX will be removed on translation in datapath
> > > > > > > > > > > > c) update global cache,
> > > > > > > > > > > > d) free lock
> > > > > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be
> > > > > > > > > > > > blocked on lock taken for cache update since point
> > > > > > > > > > > > b) till point
> > > > d).
> > > > > > > > > > > >
> > > > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > b) update global cache
> > > > > > > > > > > > c) request flush local cache
> > > > > > > > > > > > d) free lock
> > > > > > > > > > > > The traffic MIGHT be locked ONLY for MRs
> > > > > > > > > > > > non-existing in local cache (not happens for MR0
> > > > > > > > > > > > and MR1, must not happen for MRX), and probability
> > > > > > > > > > > > should be minor. And lock might happen since
> > > > > > > > > > > > c) till
> > > > > > > > > > > > d)
> > > > > > > > > > > > - quite short period of time
> > > > > > > > > > > >
> > > > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > > > >
> > > > > > > > > > > > Lock probability:
> > > > > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b),
> > > > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY, after
> > > > > > > > > > > > c)
> > > > > > > > > > > >
> > > > > > > > > > > > Lock duration:
> > > > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Finally, in conclusion, I tend to think that the
> > > > > > > > > > > > > bottom layer can do more things to ensure the
> > > > > > > > > > > > > correct execution of the program, which may have
> > > > > > > > > > > > > a negative impact on the performance in a short
> > > > > > > > > > > > > time, but in the long run, the performance will
> > > > > > > > > > > > > eventually
> > > > > > > > > come back.
> > > > > > > > > > > > > Furthermore, maybe we should pay attention to
> > > > > > > > > > > > > the performance in the stable period, and try
> > > > > > > > > > > > > our best to ensure the correctness of the
> > > > > > > > > > > > > program in case of
> > > > > > > > > > > > emergencies.
> > > > > > > > > > > >
> > > > > > > > > > > > If we have some mbufs still allocated in memory
> > > > > > > > > > > > being freed
> > > > > > > > > > > > - there is nothing to say about correctness, it is
> > > > > > > > > > > > totally incorrect. In my opinion, we should not
> > > > > > > > > > > > think how to mitigate this incorrect behavior, we
> > > > > > > > > > > > should not encourage application developers to
> > > > > > > > > > > > follow the wrong
> > > > > > > > > > > approaches.
> > > > > > > > > > > >
> > > > > > > > > > > > With best regards, Slava
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > Feifei
> > > > > > > > > > > > >
> > > > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-12 11:07                               ` [dpdk-dev] " Slava Ovsiienko
@ 2021-05-13  5:49                                 ` Feifei Wang
  2021-05-13 10:49                                   ` [dpdk-dev] " Slava Ovsiienko
  0 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-13  5:49 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd

Hi, Slava

Please see below.

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月12日 19:08
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> ..ship..
> >
> > If I understand correctly, your meaning is that if without wmb, maybe
> > other agents observe changed "dev_gen", but they also observe
> > unchanged "global" cache.
> > This can be defined as  memory inconsistent state.
> >
> > 					Fig1
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> > Timeslot        	      agent_1               		               agent_2
> > 1		     take_lock
> > 2          		update dev_gen
> > 3                       					     observe changed dev_gen
> > 4						         clear local cache
> >
> > 5            		rebuild global cache		               wait_lock
> > 6		      free_lock
> > 7		         wmb			                take_lock
> > 8							get(new MR)
> > 9                           				             	   free_lock
> > ----------------------------------------------------------------------
> > ------------------------
> > -------
> 
> Yes, something like that.
> 
> 
> > 1. However, in out-of-order platform, though adding a 'wmb at last',
> > 'dev_gen' maybe updated before global cache rebuild, and then other
> > agents can observe changed 'dev_ge'
> > before rebuilding global cache.
> >
> > Thus, though add a 'wmb at last', It is still unable to prevent other
> > agents from observing some inconsistent state. As a result, 'wmb at
> > last' fails to keep consistence.
> >
> > 2. On the other hand, due to lock, agent_2 will wait to take a lock
> > until global cache rebuilt by agent_1, and this ensures agent_2 can
> > get a correct new MR and update new local cache correctly.
> >
> > In summary, 'wmb at last' cannot guarantee other agents to observe the
> > consistent state.
> > But lock can fix this error. So, the existence of wmb at last is
> > redundant and we can remove it.
> 
> If dev_gen change is committed and cache's one is not yet - the agent_2
> might see inconsistent state even inside the lock-protected section. Hence,
> we must commit all writes before leaving the locked section in agent_1.
> 
> Let’s suppose there is no wmb in agent_1 at all, and dev_gen is arbitrary
> committed by CPU and MR cache data change is not. We leave the locked
> section in agent_1, agent_2 sees dev_gen changed, takes the lock and sees
> inconsistent MR-cache state due to not all changes made in agent_1 are
> committed. With wmb we have now in the existing code - there is no issue
> like that.

I can understand you worry that if there is no 'wmb at last', when agent_1 leave
the locked section, agent_2 still may observe unchanged global cache.

However, when agent_2 take a lock and get(new MR) in time slot 7(Fig.1), it means
agent_1 has finished updating global cache and lock is freed. 
Besides, if agent_2 can take a lock, it also shows agent_2 has observed the changed
global cache.

This is because there is a store- release in rte_rwlock_read_unlock,  store-release
ensures all store operation before 'release' can be committed  if store operation after
'release' is observed by other agents.

Thus, in our case, if agent_2 can observe updated 'rwl->cnt'(lock) and take a lock, it also can
observe updated 'dev_gen' and 'global cache'.

As a result, wmb can be removed due to store-release in R/W unlock.

Best Regards
Feifei     
 
> 
> With best regards,
> Slava
> 
> >
> > Best Regards
> > Feifei
> >
> > > With best regards,
> > > Slava
> > >
> > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > 发送时间: 2021年5月7日 18:15
> > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > Region cache
> > > > >
> > > > > Hi, Feifei
> > > > >
> > > > > We should consider the locks in your scenario - it is crucial
> > > > > for the complete model description:
> > > > >
> > > > > How agent_1 (in your terms) rebuilds global cache:
> > > > >
> > > > > 1a) lock()
> > > > > 1b) rebuild(global cache)
> > > > > 1c) update(dev_gen)
> > > > > 1d) wmb()
> > > > > 1e) unlock()
> > > > >
> > > > > How agent_2 checks:
> > > > >
> > > > > 2a) check(dev_gen) (assume positive - changed)
> > > > > 2b) clear(local_cache)
> > > > > 2c) miss(on empty local_cache) - > eventually it goes to
> > > > > mr_lookup_caches()
> > > > > 2d) lock()
> > > > > 2e) get(new MR)
> > > > > 2f) unlock()
> > > > > 2g) update(local cache with obtained new MR)
> > > > >
> > > > > Hence, even if 1c) becomes visible in 2a) before 1b) committed
> > > > > (say, due to out-of-order Arch) - the agent 2 would be blocked
> > > > > on
> > > > > 2d) and scenario depicted on your Fig2 would not happen (agent_2
> > > > > will wait before step 3 till agent 1 unlocks after its step 5).
> > > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > > > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> > Shahaf
> > > > > > Shuler <shahafs@nvidia.com>
> > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > Memory Region cache
> > > > > >
> > > > > > Hi, Slava
> > > > > >
> > > > > > Thanks very much for your reply.
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > 发送时间: 2021年5月6日 19:22
> > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > > Region cache
> > > > > > >
> > > > > > > Hi, Feifei
> > > > > > >
> > > > > > > Sorry, I do not follow why we should get rid of the last
> > > > > > > (after dev_gen update) wmb.
> > > > > > > We've rebuilt the global cache, we should notify other
> > > > > > > agents it's happened and they should flush local caches. So,
> > > > > > > dev_gen change should be made visible to other agents to
> > > > > > > trigger this activity and the second wmb is here to ensure this.
> > > > > >
> > > > > > 1. For the first problem why we should get rid of the last wmb
> > > > > > and move it before dev_gen updated, I think our attention is
> > > > > > how the wmb implements the synchronization between multiple
> agents.
> > > > > > 					Fig1
> > > > > > --------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > ------------------------
> > > > > > -------
> > > > > > Timeslot        		agent_1               		   agent_2
> > > > > > 1          		rebuild global cache
> > > > > > 2                       		wmb
> > > > > > 3            		     update dev_gen ----------------------- load
> changed
> > > > > > dev_gen
> > > > > > 4                                  			        	           rebuild local
> > cache
> > > > > > --------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > ------------------------
> > > > > > -------
> > > > > >
> > > > > > First, wmb is only for local thread to keep the order between
> > > > > > local
> > > > > > write- write :
> > > > > > Based on the picture above, for agent_1, wmb keeps the order
> > > > > > that rebuilding global cache is always before updating dev_gen.
> > > > > >
> > > > > > Second, agent_1 communicates with agent_2 by the global
> > > > > > variable "dev_gen" :
> > > > > > If agent_1 updates dev_gen, agent_2 will load it and then it
> > > > > > knows it should rebuild local cache
> > > > > >
> > > > > > Finally, agent_2 rebuilds local cache according to whether
> > > > > > agent_1 has rebuilt global cache, and agent_2 knows this
> > > > > > information by the variable
> > > > > "dev_gen".
> > > > > > 					Fig2
> > > > > > --------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > ------------------------
> > > > > > -------
> > > > > > Timeslot        		agent_1               		   agent_2
> > > > > > 1		        update dev_gen
> > > > > > 2						      load changed
> > dev_gen
> > > > > > 3						          rebuild local
> > cache
> > > > > > 4        		    rebuild global cache
> > > > > > 5			 wmb
> > > > > > --------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > --
> > > > > > ------------------------
> > > > > > -------
> > > > > >
> > > > > > However, in arm platform, if wmb is after dev_gen updated,
> > "dev_gen"
> > > > > > may be updated before agent_1 rebuilding global cache, then
> > > > > > agent_2 maybe receive error message and rebuild its local
> > > > > > cache in
> > > advance.
> > > > > >
> > > > > > To summarize, it is not important which time other agents can
> > > > > > see the changed global variable "dev_gen".
> > > > > > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen"
> > > > > > is committed to the global).
> > > > > > It is more important that if other agents see the changed
> > > > > > "dev_gen", they also can know global cache has been updated.
> > > > > >
> > > > > > > One more point, due to registering new/destroying existing
> > > > > > > MR involves FW (via kernel) calls, it takes so many CPU
> > > > > > > cycles that we could neglect wmb overhead at all.
> > > > > >
> > > > > > We just move the last wmb into the right place, and not delete
> > > > > > it for performance.
> > > > > >
> > > > > > >
> > > > > > > Also, regarding this:
> > > > > > >
> > > > > > >  > > Another question suddenly occurred to me, in order to
> > > > > > > keep the
> > > > > > > >
> > > > > > > > order that rebuilding global cache before updating
> > > > > > > > ”dev_gen“, the
> > > > > > > > > wmb should be before updating "dev_gen" rather than after it.
> > > > > > >  > > Otherwise, in the out-of-order platforms, current order
> > > > > > > cannot be
> > > > > > kept.
> > > > > > >
> > > > > > > it is not clear why ordering is important - global cache
> > > > > > > update and dev_gen change happen under spinlock protection,
> > > > > > > so only the last wmb is meaningful.
> > > > > > >
> > > > > >
> > > > > > 2. The second function of wmb before "dev_gen" updated is for
> > > > > > performance according to our previous discussion.
> > > > > > According to Fig2, if there is no wmb between "global cache
> updated"
> > > > > > and "dev_gen updated", "dev_gen" may update before global
> > > > > > cache
> > > > > updated.
> > > > > >
> > > > > > Then agent_2 may see the changed "dev_gen" and flush entire
> > > > > > local cache in advance.
> > > > > >
> > > > > > This entire flush can degrade the performance:
> > > > > > "the local cache is getting empty and can't provide
> > > > > > translation for other valid (not being removed) MRs, and the
> > > > > > translation has to look up in the global cache, that is locked
> > > > > > now for rebuilding, this causes the delays in data path on
> > > > > > acquiring global
> > cache lock."
> > > > > >
> > > > > > Furthermore, spinlock is just for global cache, not for
> > > > > > dev_gen and local cache.
> > > > > >
> > > > > > > To summarize, in my opinion:
> > > > > > > - if you see some issue with ordering of global cache
> > > > > > > update/dev_gen signalling,
> > > > > > >   could you, please, elaborate? I'm not sure we should
> > > > > > > maintain an order (due to spinlock protection)
> > > > > > > - the last rte_smp_wmb() after dev_gen incrementing should
> > > > > > > be kept intact
> > > > > > >
> > > > > >
> > > > > > At last, for my view, there are two functions that moving wmb
> > > > > > before "dev_gen"
> > > > > > for the write-write order:
> > > > > > --------------------------------
> > > > > > a) rebuild global cache;
> > > > > > b) rte_smp_wmb();
> > > > > > c) updating dev_gen
> > > > > > -------------------------------- 1. Achieve synchronization
> > > > > > between multiple threads in the right way 2.
> > > > > > Prevent other agents from flushing local cache early to ensure
> > > > > > performance
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > With best regards,
> > > > > > > Slava
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > Ruifeng
> > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Slava
> > > > > > > >
> > > > > > > > Would you have more comments about this patch?
> > > > > > > > For my sight, only one wmb before "dev_gen" updating is
> > > > > > > > enough to synchronize.
> > > > > > > >
> > > > > > > > Thanks very much for your attention.
> > > > > > > >
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > >
> > > > > > > > > -----邮件原件-----
> > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > Azrad
> > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory
> > > > > > Region
> > > > > > > > > cache
> > > > > > > > >
> > > > > > > > > Hi, Slava
> > > > > > > > >
> > > > > > > > > I think the second wmb can be removed.
> > > > > > > > > As I know, wmb is just a barrier to keep the order
> > > > > > > > > between write and
> > > > > > > write.
> > > > > > > > > and it cannot tell the CPU when it should commit the changes.
> > > > > > > > >
> > > > > > > > > It is usually used before guard variable to keep the
> > > > > > > > > order that updating guard variable after some changes,
> > > > > > > > > which you want to release,
> > > > > > > > have been done.
> > > > > > > > >
> > > > > > > > > For example, for the wmb  after global cache
> > > > > > > > > update/before altering dev_gen, it can ensure the order
> > > > > > > > > that updating global cache before altering
> > > > > > > > > dev_gen:
> > > > > > > > > 1)If other agent load the changed "dev_gen", it can know
> > > > > > > > > the global cache has been updated.
> > > > > > > > > 2)If other agents load the unchanged, "dev_gen", it
> > > > > > > > > means the global cache has not been updated, and the
> > > > > > > > > local cache will not be
> > > > > > flushed.
> > > > > > > > >
> > > > > > > > > As a result, we use  wmb and guard variable "dev_gen" to
> > > > > > > > > ensure the global cache updating is "visible".
> > > > > > > > > The "visible" means when updating guard variable "dev_gen"
> > > > > > > > > is known by other agents, they also can confirm global
> > > > > > > > > cache has been updated in the meanwhile. Thus, just one
> > > > > > > > > wmb before altering dev_gen can ensure
> > > > > > > > this.
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Feifei
> > > > > > > > >
> > > > > > > > > > -----邮件原件-----
> > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> Azrad
> > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > Ruifeng
> > > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > <nd@arm.com>;
> > > > > > > nd
> > > > > > > > > > <nd@arm.com>
> > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > Memory Region cache
> > > > > > > > > >
> > > > > > > > > > Hi, Feifei
> > > > > > > > > >
> > > > > > > > > > In my opinion, there should be 2 barriers:
> > > > > > > > > >  - after global cache update/before altering dev_gen,
> > > > > > > > > > to ensure the correct order
> > > > > > > > > >  - after altering dev_gen to make this change visible
> > > > > > > > > > for other agents and to trigger local cache update
> > > > > > > > > >
> > > > > > > > > > With best regards,
> > > > > > > > > > Slava
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > > > > > > > > > Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > > > > Ruifeng
> > > > > > > > Wang
> > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > <nd@arm.com>;
> > > > > > > > nd
> > > > > > > > > > > <nd@arm.com>
> > > > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild
> > > > > > > > > > > bug for Memory Region cache
> > > > > > > > > > >
> > > > > > > > > > > Hi, Slava
> > > > > > > > > > >
> > > > > > > > > > > Another question suddenly occurred to me, in order
> > > > > > > > > > > to keep the order that rebuilding global cache
> > > > > > > > > > > before updating ”dev_gen“, the wmb should be before
> > > > > > > > > > > updating
> > > "dev_gen"
> > > > > > > > > > > rather
> > > > > than after it.
> > > > > > > > > > > Otherwise, in the out-of-order platforms, current
> > > > > > > > > > > order cannot be
> > > > > > > kept.
> > > > > > > > > > >
> > > > > > > > > > > Thus, we should change the code as:
> > > > > > > > > > > a) rebuild global cache;
> > > > > > > > > > > b) rte_smp_wmb();
> > > > > > > > > > > c) updating dev_gen
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > > Feifei
> > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>;
> > Matan
> > > > > Azrad
> > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > stable@dpdk.org;
> > > > > > > Ruifeng
> > > > > > > > > > Wang
> > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > <nd@arm.com>
> > > > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > > > for Memory
> > > > > > > > > Region
> > > > > > > > > > > > cache
> > > > > > > > > > > >
> > > > > > > > > > > > Hi, Slava
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks very much for your explanation.
> > > > > > > > > > > >
> > > > > > > > > > > > I can understand the app can wait all mbufs are
> > > > > > > > > > > > returned to the memory pool, and then it can free
> > > > > > > > > > > > this mbufs, I agree with
> > > > > > this.
> > > > > > > > > > > >
> > > > > > > > > > > > As a result, I will remove the bug fix patch from
> > > > > > > > > > > > this series and just replace the smp barrier with
> > > > > > > > > > > > C11 thread fence. Thanks very much for your
> > > > > > > > > > > > patient explanation
> > > again.
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards
> > > > > > > > > > > > Feifei
> > > > > > > > > > > >
> > > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> > > > Azrad
> > > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > > stable@dpdk.org;
> > > > > > > > Ruifeng
> > > > > > > > > > > Wang
> > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > > > > for Memory Region cache
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please, see below
> > > > > > > > > > > > >
> > > > > > > > > > > > > ....
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes.
> > > > > > > > > > > > > > > Do we have some issue/bug with MR cache in
> > practice?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch fixes the bug which is based on
> > > > > > > > > > > > > > logical deduction, and it doesn't actually happen.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Each Tx queue has its own dedicated "local"
> > > > > > > > > > > > > > > cache for MRs to convert buffer address in
> > > > > > > > > > > > > > > mbufs being transmitted to LKeys (HW-related
> > > > > > > > > > > > > > > entity
> > > > > > > > > > > > > > > handle) and the "global" cache for all MR
> > > > > > > > > > > > > > > registered on the
> > > > > > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > > > > - check the local queue cache flush request
> > > > > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > > > > - if not found:
> > > > > > > > > > > > > > > - acquire lock for global cache read access
> > > > > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > How cache update on memory
> > > > > > > > > > > > > > > freeing/unregistering
> > > > > > happens:
> > > > > > > > > > > > > > > - acquire lock for global cache write access
> > > > > > > > > > > > > > > - [a] remove relevant MRs from the global
> > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If I understand correctly, your patch swaps
> > > > > > > > > > > > > > > [a] and [b], and local caches flush is requested
> earlier.
> > > > > > > > > > > > > > > What problem does it
> > > > > > > > solve?
> > > > > > > > > > > > > > > It is not supposed there are in datapath
> > > > > > > > > > > > > > > some mbufs referencing to the memory being
> freed.
> > > > > > > > > > > > > > > Application must ensure this and must not
> > > > > > > > > > > > > > > allocate new mbufs from this memory regions
> > > > > > > > > > being freed.
> > > > > > > > > > > > > > > Hence, the lookups for these MRs in caches
> > > > > > > > > > > > > > > should not
> > > > > > occur.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > For your first point that, application can
> > > > > > > > > > > > > > take charge of preventing MR freed memory
> > > > > > > > > > > > > > being allocated to
> > > > data path.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Does it means that If there is an emergency of
> > > > > > > > > > > > > > MR fragment, such as hotplug, the application
> > > > > > > > > > > > > > must inform thedata path in advance, and this
> > > > > > > > > > > > > > memory will not be allocated, and then the
> > > > > > > > > > > > > > control path will free this memory? If
> > > > > > > > > > > > > > application  can do like this, I agree that
> > > > > > > > > > > > > > this bug
> > > > > > > > > > > > cannot happen.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Actually,  this is the only correct way for
> > > > > > > > > > > > > application to
> > > > operate.
> > > > > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > > > > application wants to
> > > > > > > > > > free.
> > > > > > > > > > > > > ALL references to this area must be removed. If
> > > > > > > > > > > > > we have some mbufs allocated from this area, it
> > > > > > > > > > > > > means that we have memory pool created
> > > > > > > > > > > there.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What application should do:
> > > > > > > > > > > > > - notify all its components/agents the memory
> > > > > > > > > > > > > area is going to be freed
> > > > > > > > > > > > > - all components/agents free the mbufs they
> > > > > > > > > > > > > might own
> > > > > > > > > > > > > - PMD might not support freeing for some mbufs
> > > > > > > > > > > > > (for example being sent and awaiting for
> > > > > > > > > > > > > completion), so app should just wait
> > > > > > > > > > > > > - wait till all mbufs are returned to the memory
> > > > > > > > > > > > > pool (by monitoring available obj == pool size)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Otherwise - it is dangerous to free the memory.
> > > > > > > > > > > > > There are just some mbufs still allocated, it is
> > > > > > > > > > > > > regardless to buf address to MR translation. We
> > > > > > > > > > > > > just can't free the memory - the mapping will be
> > > > > > > > > > > > > destroyed and might cause the segmentation fault
> > > > > > > > > > > > > by SW or some HW issues on DMA access to
> > > > > > > > > > > > > unmapped memory.  It is very generic safety
> > > > > > > > > > > > > approach - do not free the memory that is still
> > > > > > > > > > > > > in
> > > > > > > > use.
> > > > > > > > > > > > > Hence, at the moment of freeing and
> > > > > > > > > > > > > unregistering the MR, there MUST BE NO any
> > > > > > > > > > > > mbufs in flight referencing to the addresses being freed.
> > > > > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For other side, the cache flush has negative
> > > > > > > > > > > > > > > effect
> > > > > > > > > > > > > > > - the local cache is getting empty and can't
> > > > > > > > > > > > > > > provide translation for other valid (not
> > > > > > > > > > > > > > > being
> > > > > > > > > > > > > > > removed) MRs, and the translation has to
> > > > > > > > > > > > > > > look up in the global cache, that is locked
> > > > > > > > > > > > > > > now for rebuilding, this causes the delays
> > > > > > > > > > > > > > > in datapatch
> > > > > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If above assumption is true, we can go to your
> > > > > > > > > > > > > > second
> > > > point.
> > > > > > > > > > > > > > I think this is a problem of the tradeoff
> > > > > > > > > > > > > > between cache coherence and
> > > > > > > > > > > > > performance.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I can understand your meaning that though
> > > > > > > > > > > > > > global cache has been changed, we should keep
> > > > > > > > > > > > > > the valid MR in local cache as long as
> > > > > > > > > > > > > > possible to ensure the fast
> > > > searching speed.
> > > > > > > > > > > > > > In the meanwhile, the local cache can be
> > > > > > > > > > > > > > rebuilt later to reduce its waiting time for
> > > > > > > > > > > > > > acquiring the global
> > > > cache lock.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > > > > During the next mbufs lkey searching after 'dev_gen'
> > > > > > > > > > > > > > updated, it is still necessary to update the local cache.
> > > > > > > > > > > > > > And the performance can firstly reduce and
> > > > > > > > > > > > > > then
> > returns.
> > > > > > > > > > > > > > Thus, no matter whether there is this patch or
> > > > > > > > > > > > > > not, the performance will jitter in a certain
> > > > > > > > > > > > > > period of
> > > > > > > > > > > time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Local cache should be updated to remove MRs no
> > > > > > > > > > > > > longer
> > > > valid.
> > > > > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > > > > Let's suppose we have valid MR0, MR1, and not
> > > > > > > > > > > > > valid MRX in local
> > > > > > > > > cache.
> > > > > > > > > > > > > And there are traffic in the datapath for MR0
> > > > > > > > > > > > > and MR1, and no traffic for MRX anymore.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > b) request flush local cache first - all MR0,
> > > > > > > > > > > > > MR1, MRX will be removed on translation in
> > > > > > > > > > > > > datapath
> > > > > > > > > > > > > c) update global cache,
> > > > > > > > > > > > > d) free lock
> > > > > > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will
> > > > > > > > > > > > > be blocked on lock taken for cache update since
> > > > > > > > > > > > > point
> > > > > > > > > > > > > b) till point
> > > > > d).
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > b) update global cache
> > > > > > > > > > > > > c) request flush local cache
> > > > > > > > > > > > > d) free lock
> > > > > > > > > > > > > The traffic MIGHT be locked ONLY for MRs
> > > > > > > > > > > > > non-existing in local cache (not happens for MR0
> > > > > > > > > > > > > and MR1, must not happen for MRX), and
> > > > > > > > > > > > > probability should be minor. And lock might
> > > > > > > > > > > > > happen since
> > > > > > > > > > > > > c) till
> > > > > > > > > > > > > d)
> > > > > > > > > > > > > - quite short period of time
> > > > > > > > > > > > >
> > > > > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lock probability:
> > > > > > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after
> b),
> > > > > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY,
> > > > > > > > > > > > > after
> > > > > > > > > > > > > c)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lock duration:
> > > > > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > > > > >   2) lock since c) till d), that seems to be  much shorter.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Finally, in conclusion, I tend to think that
> > > > > > > > > > > > > > the bottom layer can do more things to ensure
> > > > > > > > > > > > > > the correct execution of the program, which
> > > > > > > > > > > > > > may have a negative impact on the performance
> > > > > > > > > > > > > > in a short time, but in the long run, the
> > > > > > > > > > > > > > performance will eventually
> > > > > > > > > > come back.
> > > > > > > > > > > > > > Furthermore, maybe we should pay attention to
> > > > > > > > > > > > > > the performance in the stable period, and try
> > > > > > > > > > > > > > our best to ensure the correctness of the
> > > > > > > > > > > > > > program in case of
> > > > > > > > > > > > > emergencies.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we have some mbufs still allocated in memory
> > > > > > > > > > > > > being freed
> > > > > > > > > > > > > - there is nothing to say about correctness, it
> > > > > > > > > > > > > is totally incorrect. In my opinion, we should
> > > > > > > > > > > > > not think how to mitigate this incorrect
> > > > > > > > > > > > > behavior, we should not encourage application
> > > > > > > > > > > > > developers to follow the wrong
> > > > > > > > > > > > approaches.
> > > > > > > > > > > > >
> > > > > > > > > > > > > With best regards, Slava
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > > Feifei
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-13  5:49                                 ` [dpdk-dev] 回复: " Feifei Wang
@ 2021-05-13 10:49                                   ` Slava Ovsiienko
  2021-05-14  5:18                                     ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-13 10:49 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd

Hi, Feifei

.. snip..
> 
> I can understand you worry that if there is no 'wmb at last', when agent_1
> leave the locked section, agent_2 still may observe unchanged global cache.
> 
> However, when agent_2 take a lock and get(new MR) in time slot 7(Fig.1), it
> means
> agent_1 has finished updating global cache and lock is freed.
> Besides, if agent_2 can take a lock, it also shows agent_2 has observed the
> changed global cache.
> 
> This is because there is a store- release in rte_rwlock_read_unlock,  store-
> release ensures all store operation before 'release' can be committed  if
> store operation after 'release' is observed by other agents.

OK, I missed this implicit ordering hidden in the lock/unlock(), thank you for
pointing me out that.

I checked the listings for rte_rwlock_write_unlock(), it is implemented
with __atomic_store_n(&rwl->cnt, 0, __ATOMIC_RELEASE) and:
- on x86 there is nothing special in the code due to strict x86-64 ordering model
- on ARMv8 there is stlxr instruction, that implies the write barrier

Now you convinced me 😊, we can get rid of the explicit "wmb_at_last" in the code.
Please, provide the patch with clear commit message with details above,
it is important to mention:
- lock protection is involved, dev_gen and cache update ordering inside the protected section does not matter
- unlock() provides implicit write barrier due to atomic operation  _ATOMIC_RELEASE

Also, in my opinion, there should be small comment in place of wmb being revomed, 
reminding we are in the protected section and unlock provides the implicit write barrier
at the level visible by software.

Thank you for the meaningful discussion, I refreshed my knowledge of memory ordering models
for different architectures 😊

With best regards,
Slava

> >
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > With best regards,
> > > > Slava
> > > >
> > > >
> > > > >
> > > > > Best Regards
> > > > > Feifei
> > > > >
> > > > > > -----邮件原件-----
> > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > 发送时间: 2021年5月7日 18:15
> > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> Ruifeng
> > > > Wang
> > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > Region cache
> > > > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > We should consider the locks in your scenario - it is crucial
> > > > > > for the complete model description:
> > > > > >
> > > > > > How agent_1 (in your terms) rebuilds global cache:
> > > > > >
> > > > > > 1a) lock()
> > > > > > 1b) rebuild(global cache)
> > > > > > 1c) update(dev_gen)
> > > > > > 1d) wmb()
> > > > > > 1e) unlock()
> > > > > >
> > > > > > How agent_2 checks:
> > > > > >
> > > > > > 2a) check(dev_gen) (assume positive - changed)
> > > > > > 2b) clear(local_cache)
> > > > > > 2c) miss(on empty local_cache) - > eventually it goes to
> > > > > > mr_lookup_caches()
> > > > > > 2d) lock()
> > > > > > 2e) get(new MR)
> > > > > > 2f) unlock()
> > > > > > 2g) update(local cache with obtained new MR)
> > > > > >
> > > > > > Hence, even if 1c) becomes visible in 2a) before 1b) committed
> > > > > > (say, due to out-of-order Arch) - the agent 2 would be blocked
> > > > > > on
> > > > > > 2d) and scenario depicted on your Fig2 would not happen
> > > > > > (agent_2 will wait before step 3 till agent 1 unlocks after its step 5).
> > > > > >
> > > > > > With best regards,
> > > > > > Slava
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > > > > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> > > Shahaf
> > > > > > > Shuler <shahafs@nvidia.com>
> > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng
> > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > Memory Region cache
> > > > > > >
> > > > > > > Hi, Slava
> > > > > > >
> > > > > > > Thanks very much for your reply.
> > > > > > >
> > > > > > > > -----邮件原件-----
> > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > 发送时间: 2021年5月6日 19:22
> > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > Ruifeng
> > > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Feifei
> > > > > > > >
> > > > > > > > Sorry, I do not follow why we should get rid of the last
> > > > > > > > (after dev_gen update) wmb.
> > > > > > > > We've rebuilt the global cache, we should notify other
> > > > > > > > agents it's happened and they should flush local caches.
> > > > > > > > So, dev_gen change should be made visible to other agents
> > > > > > > > to trigger this activity and the second wmb is here to ensure this.
> > > > > > >
> > > > > > > 1. For the first problem why we should get rid of the last
> > > > > > > wmb and move it before dev_gen updated, I think our
> > > > > > > attention is how the wmb implements the synchronization
> > > > > > > between multiple
> > agents.
> > > > > > > 					Fig1
> > > > > > > ------------------------------------------------------------
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > ------------------------
> > > > > > > -------
> > > > > > > Timeslot        		agent_1               		   agent_2
> > > > > > > 1          		rebuild global cache
> > > > > > > 2                       		wmb
> > > > > > > 3            		     update dev_gen ----------------------- load
> > changed
> > > > > > > dev_gen
> > > > > > > 4                                  			        	           rebuild local
> > > cache
> > > > > > > ------------------------------------------------------------
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > ------------------------
> > > > > > > -------
> > > > > > >
> > > > > > > First, wmb is only for local thread to keep the order
> > > > > > > between local
> > > > > > > write- write :
> > > > > > > Based on the picture above, for agent_1, wmb keeps the order
> > > > > > > that rebuilding global cache is always before updating dev_gen.
> > > > > > >
> > > > > > > Second, agent_1 communicates with agent_2 by the global
> > > > > > > variable "dev_gen" :
> > > > > > > If agent_1 updates dev_gen, agent_2 will load it and then it
> > > > > > > knows it should rebuild local cache
> > > > > > >
> > > > > > > Finally, agent_2 rebuilds local cache according to whether
> > > > > > > agent_1 has rebuilt global cache, and agent_2 knows this
> > > > > > > information by the variable
> > > > > > "dev_gen".
> > > > > > > 					Fig2
> > > > > > > ------------------------------------------------------------
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > ------------------------
> > > > > > > -------
> > > > > > > Timeslot        		agent_1               		   agent_2
> > > > > > > 1		        update dev_gen
> > > > > > > 2						      load changed
> > > dev_gen
> > > > > > > 3						          rebuild local
> > > cache
> > > > > > > 4        		    rebuild global cache
> > > > > > > 5			 wmb
> > > > > > > ------------------------------------------------------------
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > --
> > > > > > > ------------------------
> > > > > > > -------
> > > > > > >
> > > > > > > However, in arm platform, if wmb is after dev_gen updated,
> > > "dev_gen"
> > > > > > > may be updated before agent_1 rebuilding global cache, then
> > > > > > > agent_2 maybe receive error message and rebuild its local
> > > > > > > cache in
> > > > advance.
> > > > > > >
> > > > > > > To summarize, it is not important which time other agents
> > > > > > > can see the changed global variable "dev_gen".
> > > > > > > (Actually, wmb after "dev_gen" cannot ensure changed
> "dev_gen"
> > > > > > > is committed to the global).
> > > > > > > It is more important that if other agents see the changed
> > > > > > > "dev_gen", they also can know global cache has been updated.
> > > > > > >
> > > > > > > > One more point, due to registering new/destroying existing
> > > > > > > > MR involves FW (via kernel) calls, it takes so many CPU
> > > > > > > > cycles that we could neglect wmb overhead at all.
> > > > > > >
> > > > > > > We just move the last wmb into the right place, and not
> > > > > > > delete it for performance.
> > > > > > >
> > > > > > > >
> > > > > > > > Also, regarding this:
> > > > > > > >
> > > > > > > >  > > Another question suddenly occurred to me, in order to
> > > > > > > > keep the
> > > > > > > > >
> > > > > > > > > order that rebuilding global cache before updating
> > > > > > > > > ”dev_gen“, the
> > > > > > > > > > wmb should be before updating "dev_gen" rather than after
> it.
> > > > > > > >  > > Otherwise, in the out-of-order platforms, current
> > > > > > > > order cannot be
> > > > > > > kept.
> > > > > > > >
> > > > > > > > it is not clear why ordering is important - global cache
> > > > > > > > update and dev_gen change happen under spinlock
> > > > > > > > protection, so only the last wmb is meaningful.
> > > > > > > >
> > > > > > >
> > > > > > > 2. The second function of wmb before "dev_gen" updated is
> > > > > > > for performance according to our previous discussion.
> > > > > > > According to Fig2, if there is no wmb between "global cache
> > updated"
> > > > > > > and "dev_gen updated", "dev_gen" may update before global
> > > > > > > cache
> > > > > > updated.
> > > > > > >
> > > > > > > Then agent_2 may see the changed "dev_gen" and flush entire
> > > > > > > local cache in advance.
> > > > > > >
> > > > > > > This entire flush can degrade the performance:
> > > > > > > "the local cache is getting empty and can't provide
> > > > > > > translation for other valid (not being removed) MRs, and the
> > > > > > > translation has to look up in the global cache, that is
> > > > > > > locked now for rebuilding, this causes the delays in data
> > > > > > > path on acquiring global
> > > cache lock."
> > > > > > >
> > > > > > > Furthermore, spinlock is just for global cache, not for
> > > > > > > dev_gen and local cache.
> > > > > > >
> > > > > > > > To summarize, in my opinion:
> > > > > > > > - if you see some issue with ordering of global cache
> > > > > > > > update/dev_gen signalling,
> > > > > > > >   could you, please, elaborate? I'm not sure we should
> > > > > > > > maintain an order (due to spinlock protection)
> > > > > > > > - the last rte_smp_wmb() after dev_gen incrementing should
> > > > > > > > be kept intact
> > > > > > > >
> > > > > > >
> > > > > > > At last, for my view, there are two functions that moving
> > > > > > > wmb before "dev_gen"
> > > > > > > for the write-write order:
> > > > > > > --------------------------------
> > > > > > > a) rebuild global cache;
> > > > > > > b) rte_smp_wmb();
> > > > > > > c) updating dev_gen
> > > > > > > -------------------------------- 1. Achieve synchronization
> > > > > > > between multiple threads in the right way 2.
> > > > > > > Prevent other agents from flushing local cache early to
> > > > > > > ensure performance
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Feifei
> > > > > > >
> > > > > > > > With best regards,
> > > > > > > > Slava
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > > > > > > > Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > > Ruifeng
> > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > for Memory Region cache
> > > > > > > > >
> > > > > > > > > Hi, Slava
> > > > > > > > >
> > > > > > > > > Would you have more comments about this patch?
> > > > > > > > > For my sight, only one wmb before "dev_gen" updating is
> > > > > > > > > enough to synchronize.
> > > > > > > > >
> > > > > > > > > Thanks very much for your attention.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards
> > > > > > > > > Feifei
> > > > > > > > >
> > > > > > > > > > -----邮件原件-----
> > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > Azrad
> > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > Ruifeng
> > > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > Memory
> > > > > > > Region
> > > > > > > > > > cache
> > > > > > > > > >
> > > > > > > > > > Hi, Slava
> > > > > > > > > >
> > > > > > > > > > I think the second wmb can be removed.
> > > > > > > > > > As I know, wmb is just a barrier to keep the order
> > > > > > > > > > between write and
> > > > > > > > write.
> > > > > > > > > > and it cannot tell the CPU when it should commit the
> changes.
> > > > > > > > > >
> > > > > > > > > > It is usually used before guard variable to keep the
> > > > > > > > > > order that updating guard variable after some changes,
> > > > > > > > > > which you want to release,
> > > > > > > > > have been done.
> > > > > > > > > >
> > > > > > > > > > For example, for the wmb  after global cache
> > > > > > > > > > update/before altering dev_gen, it can ensure the
> > > > > > > > > > order that updating global cache before altering
> > > > > > > > > > dev_gen:
> > > > > > > > > > 1)If other agent load the changed "dev_gen", it can
> > > > > > > > > > know the global cache has been updated.
> > > > > > > > > > 2)If other agents load the unchanged, "dev_gen", it
> > > > > > > > > > means the global cache has not been updated, and the
> > > > > > > > > > local cache will not be
> > > > > > > flushed.
> > > > > > > > > >
> > > > > > > > > > As a result, we use  wmb and guard variable "dev_gen"
> > > > > > > > > > to ensure the global cache updating is "visible".
> > > > > > > > > > The "visible" means when updating guard variable
> "dev_gen"
> > > > > > > > > > is known by other agents, they also can confirm global
> > > > > > > > > > cache has been updated in the meanwhile. Thus, just
> > > > > > > > > > one wmb before altering dev_gen can ensure
> > > > > > > > > this.
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Feifei
> > > > > > > > > >
> > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> > Azrad
> > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> stable@dpdk.org;
> > > > > > Ruifeng
> > > > > > > > > Wang
> > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > <nd@arm.com>;
> > > > > > > > nd
> > > > > > > > > > > <nd@arm.com>
> > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > > Memory Region cache
> > > > > > > > > > >
> > > > > > > > > > > Hi, Feifei
> > > > > > > > > > >
> > > > > > > > > > > In my opinion, there should be 2 barriers:
> > > > > > > > > > >  - after global cache update/before altering
> > > > > > > > > > > dev_gen, to ensure the correct order
> > > > > > > > > > >  - after altering dev_gen to make this change
> > > > > > > > > > > visible for other agents and to trigger local cache
> > > > > > > > > > > update
> > > > > > > > > > >
> > > > > > > > > > > With best regards,
> > > > > > > > > > > Slava
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > > > > > > > > > Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>;
> > > > > > > > > > > > stable@dpdk.org; Ruifeng
> > > > > > > > > Wang
> > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > > <nd@arm.com>;
> > > > > > > > > nd
> > > > > > > > > > > > <nd@arm.com>
> > > > > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild
> > > > > > > > > > > > bug for Memory Region cache
> > > > > > > > > > > >
> > > > > > > > > > > > Hi, Slava
> > > > > > > > > > > >
> > > > > > > > > > > > Another question suddenly occurred to me, in order
> > > > > > > > > > > > to keep the order that rebuilding global cache
> > > > > > > > > > > > before updating ”dev_gen“, the wmb should be
> > > > > > > > > > > > before updating
> > > > "dev_gen"
> > > > > > > > > > > > rather
> > > > > > than after it.
> > > > > > > > > > > > Otherwise, in the out-of-order platforms, current
> > > > > > > > > > > > order cannot be
> > > > > > > > kept.
> > > > > > > > > > > >
> > > > > > > > > > > > Thus, we should change the code as:
> > > > > > > > > > > > a) rebuild global cache;
> > > > > > > > > > > > b) rte_smp_wmb();
> > > > > > > > > > > > c) updating dev_gen
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards
> > > > > > > > > > > > Feifei
> > > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > Matan
> > > > > > Azrad
> > > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > > stable@dpdk.org;
> > > > > > > > Ruifeng
> > > > > > > > > > > Wang
> > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > > <nd@arm.com>
> > > > > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > > > > for Memory
> > > > > > > > > > Region
> > > > > > > > > > > > > cache
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Slava
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks very much for your explanation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I can understand the app can wait all mbufs are
> > > > > > > > > > > > > returned to the memory pool, and then it can
> > > > > > > > > > > > > free this mbufs, I agree with
> > > > > > > this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As a result, I will remove the bug fix patch
> > > > > > > > > > > > > from this series and just replace the smp
> > > > > > > > > > > > > barrier with
> > > > > > > > > > > > > C11 thread fence. Thanks very much for your
> > > > > > > > > > > > > patient explanation
> > > > again.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > Feifei
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>;
> Matan
> > > > > Azrad
> > > > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > > > stable@dpdk.org;
> > > > > > > > > Ruifeng
> > > > > > > > > > > > Wang
> > > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild
> > > > > > > > > > > > > > bug for Memory Region cache
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please, see below
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ....
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes.
> > > > > > > > > > > > > > > > Do we have some issue/bug with MR cache in
> > > practice?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patch fixes the bug which is based on
> > > > > > > > > > > > > > > logical deduction, and it doesn't actually happen.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Each Tx queue has its own dedicated "local"
> > > > > > > > > > > > > > > > cache for MRs to convert buffer address in
> > > > > > > > > > > > > > > > mbufs being transmitted to LKeys
> > > > > > > > > > > > > > > > (HW-related entity
> > > > > > > > > > > > > > > > handle) and the "global" cache for all MR
> > > > > > > > > > > > > > > > registered on the
> > > > > > > > > device.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > > > > > - check the local queue cache flush
> > > > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > > > > > - if not found:
> > > > > > > > > > > > > > > > - acquire lock for global cache read
> > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > How cache update on memory
> > > > > > > > > > > > > > > > freeing/unregistering
> > > > > > > happens:
> > > > > > > > > > > > > > > > - acquire lock for global cache write
> > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > - [a] remove relevant MRs from the global
> > > > > > > > > > > > > > > > cache
> > > > > > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If I understand correctly, your patch
> > > > > > > > > > > > > > > > swaps [a] and [b], and local caches flush
> > > > > > > > > > > > > > > > is requested
> > earlier.
> > > > > > > > > > > > > > > > What problem does it
> > > > > > > > > solve?
> > > > > > > > > > > > > > > > It is not supposed there are in datapath
> > > > > > > > > > > > > > > > some mbufs referencing to the memory being
> > freed.
> > > > > > > > > > > > > > > > Application must ensure this and must not
> > > > > > > > > > > > > > > > allocate new mbufs from this memory
> > > > > > > > > > > > > > > > regions
> > > > > > > > > > > being freed.
> > > > > > > > > > > > > > > > Hence, the lookups for these MRs in caches
> > > > > > > > > > > > > > > > should not
> > > > > > > occur.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For your first point that, application can
> > > > > > > > > > > > > > > take charge of preventing MR freed memory
> > > > > > > > > > > > > > > being allocated to
> > > > > data path.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Does it means that If there is an emergency
> > > > > > > > > > > > > > > of MR fragment, such as hotplug, the
> > > > > > > > > > > > > > > application must inform thedata path in
> > > > > > > > > > > > > > > advance, and this memory will not be
> > > > > > > > > > > > > > > allocated, and then the control path will
> > > > > > > > > > > > > > > free this memory? If application  can do
> > > > > > > > > > > > > > > like this, I agree that this bug
> > > > > > > > > > > > > cannot happen.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Actually,  this is the only correct way for
> > > > > > > > > > > > > > application to
> > > > > operate.
> > > > > > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > > > > > application wants to
> > > > > > > > > > > free.
> > > > > > > > > > > > > > ALL references to this area must be removed.
> > > > > > > > > > > > > > If we have some mbufs allocated from this
> > > > > > > > > > > > > > area, it means that we have memory pool
> > > > > > > > > > > > > > created
> > > > > > > > > > > > there.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What application should do:
> > > > > > > > > > > > > > - notify all its components/agents the memory
> > > > > > > > > > > > > > area is going to be freed
> > > > > > > > > > > > > > - all components/agents free the mbufs they
> > > > > > > > > > > > > > might own
> > > > > > > > > > > > > > - PMD might not support freeing for some mbufs
> > > > > > > > > > > > > > (for example being sent and awaiting for
> > > > > > > > > > > > > > completion), so app should just wait
> > > > > > > > > > > > > > - wait till all mbufs are returned to the
> > > > > > > > > > > > > > memory pool (by monitoring available obj ==
> > > > > > > > > > > > > > pool size)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Otherwise - it is dangerous to free the memory.
> > > > > > > > > > > > > > There are just some mbufs still allocated, it
> > > > > > > > > > > > > > is regardless to buf address to MR
> > > > > > > > > > > > > > translation. We just can't free the memory -
> > > > > > > > > > > > > > the mapping will be destroyed and might cause
> > > > > > > > > > > > > > the segmentation fault by SW or some HW issues
> > > > > > > > > > > > > > on DMA access to unmapped memory.  It is very
> > > > > > > > > > > > > > generic safety approach - do not free the
> > > > > > > > > > > > > > memory that is still in
> > > > > > > > > use.
> > > > > > > > > > > > > > Hence, at the moment of freeing and
> > > > > > > > > > > > > > unregistering the MR, there MUST BE NO any
> > > > > > > > > > > > > mbufs in flight referencing to the addresses being
> freed.
> > > > > > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For other side, the cache flush has
> > > > > > > > > > > > > > > > negative effect
> > > > > > > > > > > > > > > > - the local cache is getting empty and
> > > > > > > > > > > > > > > > can't provide translation for other valid
> > > > > > > > > > > > > > > > (not being
> > > > > > > > > > > > > > > > removed) MRs, and the translation has to
> > > > > > > > > > > > > > > > look up in the global cache, that is
> > > > > > > > > > > > > > > > locked now for rebuilding, this causes the
> > > > > > > > > > > > > > > > delays in datapatch
> > > > > > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If above assumption is true, we can go to
> > > > > > > > > > > > > > > your second
> > > > > point.
> > > > > > > > > > > > > > > I think this is a problem of the tradeoff
> > > > > > > > > > > > > > > between cache coherence and
> > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I can understand your meaning that though
> > > > > > > > > > > > > > > global cache has been changed, we should
> > > > > > > > > > > > > > > keep the valid MR in local cache as long as
> > > > > > > > > > > > > > > possible to ensure the fast
> > > > > searching speed.
> > > > > > > > > > > > > > > In the meanwhile, the local cache can be
> > > > > > > > > > > > > > > rebuilt later to reduce its waiting time for
> > > > > > > > > > > > > > > acquiring the global
> > > > > cache lock.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > > > > > During the next mbufs lkey searching after
> 'dev_gen'
> > > > > > > > > > > > > > > updated, it is still necessary to update the local
> cache.
> > > > > > > > > > > > > > > And the performance can firstly reduce and
> > > > > > > > > > > > > > > then
> > > returns.
> > > > > > > > > > > > > > > Thus, no matter whether there is this patch
> > > > > > > > > > > > > > > or not, the performance will jitter in a
> > > > > > > > > > > > > > > certain period of
> > > > > > > > > > > > time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Local cache should be updated to remove MRs no
> > > > > > > > > > > > > > longer
> > > > > valid.
> > > > > > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > > > > > Let's suppose we have valid MR0, MR1, and not
> > > > > > > > > > > > > > valid MRX in local
> > > > > > > > > > cache.
> > > > > > > > > > > > > > And there are traffic in the datapath for MR0
> > > > > > > > > > > > > > and MR1, and no traffic for MRX anymore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > > b) request flush local cache first - all MR0,
> > > > > > > > > > > > > > MR1, MRX will be removed on translation in
> > > > > > > > > > > > > > datapath
> > > > > > > > > > > > > > c) update global cache,
> > > > > > > > > > > > > > d) free lock
> > > > > > > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will
> > > > > > > > > > > > > > be blocked on lock taken for cache update
> > > > > > > > > > > > > > since point
> > > > > > > > > > > > > > b) till point
> > > > > > d).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > > b) update global cache
> > > > > > > > > > > > > > c) request flush local cache
> > > > > > > > > > > > > > d) free lock
> > > > > > > > > > > > > > The traffic MIGHT be locked ONLY for MRs
> > > > > > > > > > > > > > non-existing in local cache (not happens for
> > > > > > > > > > > > > > MR0 and MR1, must not happen for MRX), and
> > > > > > > > > > > > > > probability should be minor. And lock might
> > > > > > > > > > > > > > happen since
> > > > > > > > > > > > > > c) till
> > > > > > > > > > > > > > d)
> > > > > > > > > > > > > > - quite short period of time
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lock probability:
> > > > > > > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation
> > > > > > > > > > > > > > after
> > b),
> > > > > > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY,
> > > > > > > > > > > > > > after
> > > > > > > > > > > > > > c)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Lock duration:
> > > > > > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > > > > > >   2) lock since c) till d), that seems to be  much
> shorter.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Finally, in conclusion, I tend to think that
> > > > > > > > > > > > > > > the bottom layer can do more things to
> > > > > > > > > > > > > > > ensure the correct execution of the program,
> > > > > > > > > > > > > > > which may have a negative impact on the
> > > > > > > > > > > > > > > performance in a short time, but in the long
> > > > > > > > > > > > > > > run, the performance will eventually
> > > > > > > > > > > come back.
> > > > > > > > > > > > > > > Furthermore, maybe we should pay attention
> > > > > > > > > > > > > > > to the performance in the stable period, and
> > > > > > > > > > > > > > > try our best to ensure the correctness of
> > > > > > > > > > > > > > > the program in case of
> > > > > > > > > > > > > > emergencies.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If we have some mbufs still allocated in
> > > > > > > > > > > > > > memory being freed
> > > > > > > > > > > > > > - there is nothing to say about correctness,
> > > > > > > > > > > > > > it is totally incorrect. In my opinion, we
> > > > > > > > > > > > > > should not think how to mitigate this
> > > > > > > > > > > > > > incorrect behavior, we should not encourage
> > > > > > > > > > > > > > application developers to follow the wrong
> > > > > > > > > > > > > approaches.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With best regards, Slava
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best Regards Feifei
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
  2021-05-13 10:49                                   ` [dpdk-dev] " Slava Ovsiienko
@ 2021-05-14  5:18                                     ` Feifei Wang
  0 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-14  5:18 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler
  Cc: dev, nd, stable, Ruifeng Wang, nd, nd, nd

Hi, Slava

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月13日 18:49
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> .. snip..
> >
> > I can understand you worry that if there is no 'wmb at last', when
> > agent_1 leave the locked section, agent_2 still may observe unchanged
> global cache.
> >
> > However, when agent_2 take a lock and get(new MR) in time slot
> > 7(Fig.1), it means
> > agent_1 has finished updating global cache and lock is freed.
> > Besides, if agent_2 can take a lock, it also shows agent_2 has
> > observed the changed global cache.
> >
> > This is because there is a store- release in rte_rwlock_read_unlock,
> > store- release ensures all store operation before 'release' can be
> > committed  if store operation after 'release' is observed by other agents.
> 
> OK, I missed this implicit ordering hidden in the lock/unlock(), thank you for
> pointing me out that.
> 
> I checked the listings for rte_rwlock_write_unlock(), it is implemented with
> __atomic_store_n(&rwl->cnt, 0, __ATOMIC_RELEASE) and:
> - on x86 there is nothing special in the code due to strict x86-64 ordering
> model
> - on ARMv8 there is stlxr instruction, that implies the write barrier
> 
> Now you convinced me 😊, we can get rid of the explicit "wmb_at_last" in
> the code.
> Please, provide the patch with clear commit message with details above, it is
> important to mention:
> - lock protection is involved, dev_gen and cache update ordering inside the
> protected section does not matter
> - unlock() provides implicit write barrier due to atomic operation
> _ATOMIC_RELEASE

Ok, I will be careful with the commit message and include above details in
it.

> 
> Also, in my opinion, there should be small comment in place of wmb being
> revomed, reminding we are in the protected section and unlock provides the
> implicit write barrier at the level visible by software.
> 

Yes, it is necessary to add small comment in the place of wmb to explain
why we do this.

> Thank you for the meaningful discussion, I refreshed my knowledge of
> memory ordering models for different architectures 😊
> 

Also thanks very much for this discussion, I think I learn a lot from it and
enjoy exploring the unknown problems.
Looking forward to your review about the next version, thanks a lot 😊.

Best Regards
Feifei

> With best regards,
> Slava
> 
> > >
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > With best regards,
> > > > > Slava
> > > > >
> > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Feifei
> > > > > >
> > > > > > > -----邮件原件-----
> > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > 发送时间: 2021年5月7日 18:15
> > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > Ruifeng
> > > > > Wang
> > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > > > > Region cache
> > > > > > >
> > > > > > > Hi, Feifei
> > > > > > >
> > > > > > > We should consider the locks in your scenario - it is
> > > > > > > crucial for the complete model description:
> > > > > > >
> > > > > > > How agent_1 (in your terms) rebuilds global cache:
> > > > > > >
> > > > > > > 1a) lock()
> > > > > > > 1b) rebuild(global cache)
> > > > > > > 1c) update(dev_gen)
> > > > > > > 1d) wmb()
> > > > > > > 1e) unlock()
> > > > > > >
> > > > > > > How agent_2 checks:
> > > > > > >
> > > > > > > 2a) check(dev_gen) (assume positive - changed)
> > > > > > > 2b) clear(local_cache)
> > > > > > > 2c) miss(on empty local_cache) - > eventually it goes to
> > > > > > > mr_lookup_caches()
> > > > > > > 2d) lock()
> > > > > > > 2e) get(new MR)
> > > > > > > 2f) unlock()
> > > > > > > 2g) update(local cache with obtained new MR)
> > > > > > >
> > > > > > > Hence, even if 1c) becomes visible in 2a) before 1b)
> > > > > > > committed (say, due to out-of-order Arch) - the agent 2
> > > > > > > would be blocked on
> > > > > > > 2d) and scenario depicted on your Fig2 would not happen
> > > > > > > (agent_2 will wait before step 3 till agent 1 unlocks after its step 5).
> > > > > > >
> > > > > > > With best regards,
> > > > > > > Slava
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko
> > > > > > > > <viacheslavo@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> > > > Shahaf
> > > > > > > > Shuler <shahafs@nvidia.com>
> > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > Ruifeng
> > > > > Wang
> > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > Memory Region cache
> > > > > > > >
> > > > > > > > Hi, Slava
> > > > > > > >
> > > > > > > > Thanks very much for your reply.
> > > > > > > >
> > > > > > > > > -----邮件原件-----
> > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > 发送时间: 2021年5月6日 19:22
> > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> > > > > > > > > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > Memory Region cache
> > > > > > > > >
> > > > > > > > > Hi, Feifei
> > > > > > > > >
> > > > > > > > > Sorry, I do not follow why we should get rid of the last
> > > > > > > > > (after dev_gen update) wmb.
> > > > > > > > > We've rebuilt the global cache, we should notify other
> > > > > > > > > agents it's happened and they should flush local caches.
> > > > > > > > > So, dev_gen change should be made visible to other
> > > > > > > > > agents to trigger this activity and the second wmb is here to
> ensure this.
> > > > > > > >
> > > > > > > > 1. For the first problem why we should get rid of the last
> > > > > > > > wmb and move it before dev_gen updated, I think our
> > > > > > > > attention is how the wmb implements the synchronization
> > > > > > > > between multiple
> > > agents.
> > > > > > > > 					Fig1
> > > > > > > > ----------------------------------------------------------
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > ------------------------
> > > > > > > > -------
> > > > > > > > Timeslot        		agent_1
> agent_2
> > > > > > > > 1          		rebuild global cache
> > > > > > > > 2                       		wmb
> > > > > > > > 3            		     update dev_gen ----------------------- load
> > > changed
> > > > > > > > dev_gen
> > > > > > > > 4                                  			        	           rebuild local
> > > > cache
> > > > > > > > ----------------------------------------------------------
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > ------------------------
> > > > > > > > -------
> > > > > > > >
> > > > > > > > First, wmb is only for local thread to keep the order
> > > > > > > > between local
> > > > > > > > write- write :
> > > > > > > > Based on the picture above, for agent_1, wmb keeps the
> > > > > > > > order that rebuilding global cache is always before updating
> dev_gen.
> > > > > > > >
> > > > > > > > Second, agent_1 communicates with agent_2 by the global
> > > > > > > > variable "dev_gen" :
> > > > > > > > If agent_1 updates dev_gen, agent_2 will load it and then
> > > > > > > > it knows it should rebuild local cache
> > > > > > > >
> > > > > > > > Finally, agent_2 rebuilds local cache according to whether
> > > > > > > > agent_1 has rebuilt global cache, and agent_2 knows this
> > > > > > > > information by the variable
> > > > > > > "dev_gen".
> > > > > > > > 					Fig2
> > > > > > > > ----------------------------------------------------------
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > ------------------------
> > > > > > > > -------
> > > > > > > > Timeslot        		agent_1
> agent_2
> > > > > > > > 1		        update dev_gen
> > > > > > > > 2						      load changed
> > > > dev_gen
> > > > > > > > 3						          rebuild local
> > > > cache
> > > > > > > > 4        		    rebuild global cache
> > > > > > > > 5			 wmb
> > > > > > > > ----------------------------------------------------------
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > ------------------------
> > > > > > > > -------
> > > > > > > >
> > > > > > > > However, in arm platform, if wmb is after dev_gen updated,
> > > > "dev_gen"
> > > > > > > > may be updated before agent_1 rebuilding global cache,
> > > > > > > > then
> > > > > > > > agent_2 maybe receive error message and rebuild its local
> > > > > > > > cache in
> > > > > advance.
> > > > > > > >
> > > > > > > > To summarize, it is not important which time other agents
> > > > > > > > can see the changed global variable "dev_gen".
> > > > > > > > (Actually, wmb after "dev_gen" cannot ensure changed
> > "dev_gen"
> > > > > > > > is committed to the global).
> > > > > > > > It is more important that if other agents see the changed
> > > > > > > > "dev_gen", they also can know global cache has been updated.
> > > > > > > >
> > > > > > > > > One more point, due to registering new/destroying
> > > > > > > > > existing MR involves FW (via kernel) calls, it takes so
> > > > > > > > > many CPU cycles that we could neglect wmb overhead at all.
> > > > > > > >
> > > > > > > > We just move the last wmb into the right place, and not
> > > > > > > > delete it for performance.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Also, regarding this:
> > > > > > > > >
> > > > > > > > >  > > Another question suddenly occurred to me, in order
> > > > > > > > > to keep the
> > > > > > > > > >
> > > > > > > > > > order that rebuilding global cache before updating
> > > > > > > > > > ”dev_gen“, the
> > > > > > > > > > > wmb should be before updating "dev_gen" rather than
> > > > > > > > > > > after
> > it.
> > > > > > > > >  > > Otherwise, in the out-of-order platforms, current
> > > > > > > > > order cannot be
> > > > > > > > kept.
> > > > > > > > >
> > > > > > > > > it is not clear why ordering is important - global cache
> > > > > > > > > update and dev_gen change happen under spinlock
> > > > > > > > > protection, so only the last wmb is meaningful.
> > > > > > > > >
> > > > > > > >
> > > > > > > > 2. The second function of wmb before "dev_gen" updated is
> > > > > > > > for performance according to our previous discussion.
> > > > > > > > According to Fig2, if there is no wmb between "global
> > > > > > > > cache
> > > updated"
> > > > > > > > and "dev_gen updated", "dev_gen" may update before global
> > > > > > > > cache
> > > > > > > updated.
> > > > > > > >
> > > > > > > > Then agent_2 may see the changed "dev_gen" and flush
> > > > > > > > entire local cache in advance.
> > > > > > > >
> > > > > > > > This entire flush can degrade the performance:
> > > > > > > > "the local cache is getting empty and can't provide
> > > > > > > > translation for other valid (not being removed) MRs, and
> > > > > > > > the translation has to look up in the global cache, that
> > > > > > > > is locked now for rebuilding, this causes the delays in
> > > > > > > > data path on acquiring global
> > > > cache lock."
> > > > > > > >
> > > > > > > > Furthermore, spinlock is just for global cache, not for
> > > > > > > > dev_gen and local cache.
> > > > > > > >
> > > > > > > > > To summarize, in my opinion:
> > > > > > > > > - if you see some issue with ordering of global cache
> > > > > > > > > update/dev_gen signalling,
> > > > > > > > >   could you, please, elaborate? I'm not sure we should
> > > > > > > > > maintain an order (due to spinlock protection)
> > > > > > > > > - the last rte_smp_wmb() after dev_gen incrementing
> > > > > > > > > should be kept intact
> > > > > > > > >
> > > > > > > >
> > > > > > > > At last, for my view, there are two functions that moving
> > > > > > > > wmb before "dev_gen"
> > > > > > > > for the write-write order:
> > > > > > > > --------------------------------
> > > > > > > > a) rebuild global cache;
> > > > > > > > b) rte_smp_wmb();
> > > > > > > > c) updating dev_gen
> > > > > > > > -------------------------------- 1. Achieve
> > > > > > > > synchronization between multiple threads in the right way 2.
> > > > > > > > Prevent other agents from flushing local cache early to
> > > > > > > > ensure performance
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Feifei
> > > > > > > >
> > > > > > > > > With best regards,
> > > > > > > > > Slava
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > > Sent: Thursday, May 6, 2021 5:52
> > > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan
> > > > > > > > > > Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org;
> > > > > > > > > > Ruifeng
> > > > > > > Wang
> > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > for Memory Region cache
> > > > > > > > > >
> > > > > > > > > > Hi, Slava
> > > > > > > > > >
> > > > > > > > > > Would you have more comments about this patch?
> > > > > > > > > > For my sight, only one wmb before "dev_gen" updating
> > > > > > > > > > is enough to synchronize.
> > > > > > > > > >
> > > > > > > > > > Thanks very much for your attention.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > Feifei
> > > > > > > > > >
> > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > > 发送时间: 2021年4月20日 16:42
> > > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>;
> Matan
> > > > Azrad
> > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> stable@dpdk.org;
> > > > > > Ruifeng
> > > > > > > > > Wang
> > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for
> > > > > > > > > > > Memory
> > > > > > > > Region
> > > > > > > > > > > cache
> > > > > > > > > > >
> > > > > > > > > > > Hi, Slava
> > > > > > > > > > >
> > > > > > > > > > > I think the second wmb can be removed.
> > > > > > > > > > > As I know, wmb is just a barrier to keep the order
> > > > > > > > > > > between write and
> > > > > > > > > write.
> > > > > > > > > > > and it cannot tell the CPU when it should commit the
> > changes.
> > > > > > > > > > >
> > > > > > > > > > > It is usually used before guard variable to keep the
> > > > > > > > > > > order that updating guard variable after some
> > > > > > > > > > > changes, which you want to release,
> > > > > > > > > > have been done.
> > > > > > > > > > >
> > > > > > > > > > > For example, for the wmb  after global cache
> > > > > > > > > > > update/before altering dev_gen, it can ensure the
> > > > > > > > > > > order that updating global cache before altering
> > > > > > > > > > > dev_gen:
> > > > > > > > > > > 1)If other agent load the changed "dev_gen", it can
> > > > > > > > > > > know the global cache has been updated.
> > > > > > > > > > > 2)If other agents load the unchanged, "dev_gen", it
> > > > > > > > > > > means the global cache has not been updated, and the
> > > > > > > > > > > local cache will not be
> > > > > > > > flushed.
> > > > > > > > > > >
> > > > > > > > > > > As a result, we use  wmb and guard variable "dev_gen"
> > > > > > > > > > > to ensure the global cache updating is "visible".
> > > > > > > > > > > The "visible" means when updating guard variable
> > "dev_gen"
> > > > > > > > > > > is known by other agents, they also can confirm
> > > > > > > > > > > global cache has been updated in the meanwhile.
> > > > > > > > > > > Thus, just one wmb before altering dev_gen can
> > > > > > > > > > > ensure
> > > > > > > > > > this.
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > > Feifei
> > > > > > > > > > >
> > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > > > > > > > > > > 发送时间: 2021年4月20日 15:54
> > > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan
> > > Azrad
> > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > stable@dpdk.org;
> > > > > > > Ruifeng
> > > > > > > > > > Wang
> > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > > <nd@arm.com>;
> > > > > > > > > nd
> > > > > > > > > > > > <nd@arm.com>
> > > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug
> > > > > > > > > > > > for Memory Region cache
> > > > > > > > > > > >
> > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > >
> > > > > > > > > > > > In my opinion, there should be 2 barriers:
> > > > > > > > > > > >  - after global cache update/before altering
> > > > > > > > > > > > dev_gen, to ensure the correct order
> > > > > > > > > > > >  - after altering dev_gen to make this change
> > > > > > > > > > > > visible for other agents and to trigger local
> > > > > > > > > > > > cache update
> > > > > > > > > > > >
> > > > > > > > > > > > With best regards, Slava
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Feifei Wang <Feifei.Wang2@arm.com>
> > > > > > > > > > > > > Sent: Tuesday, April 20, 2021 10:30
> > > > > > > > > > > > > To: Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > > > > > > > > > > Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > Cc: dev@dpdk.org; nd <nd@arm.com>;
> > > > > > > > > > > > > stable@dpdk.org; Ruifeng
> > > > > > > > > > Wang
> > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > > > <nd@arm.com>;
> > > > > > > > > > nd
> > > > > > > > > > > > > <nd@arm.com>
> > > > > > > > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix
> > > > > > > > > > > > > rebuild bug for Memory Region cache
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi, Slava
> > > > > > > > > > > > >
> > > > > > > > > > > > > Another question suddenly occurred to me, in
> > > > > > > > > > > > > order to keep the order that rebuilding global
> > > > > > > > > > > > > cache before updating ”dev_gen“, the wmb should
> > > > > > > > > > > > > be before updating
> > > > > "dev_gen"
> > > > > > > > > > > > > rather
> > > > > > > than after it.
> > > > > > > > > > > > > Otherwise, in the out-of-order platforms,
> > > > > > > > > > > > > current order cannot be
> > > > > > > > > kept.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thus, we should change the code as:
> > > > > > > > > > > > > a) rebuild global cache;
> > > > > > > > > > > > > b) rte_smp_wmb();
> > > > > > > > > > > > > c) updating dev_gen
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > Feifei
> > > > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > > > 发件人: Feifei Wang
> > > > > > > > > > > > > > 发送时间: 2021年4月20日 13:54
> > > > > > > > > > > > > > 收件人: Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > Matan
> > > > > > > Azrad
> > > > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > > > stable@dpdk.org;
> > > > > > > > > Ruifeng
> > > > > > > > > > > > Wang
> > > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>; nd
> > > > > > > > <nd@arm.com>
> > > > > > > > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild
> > > > > > > > > > > > > > bug for Memory
> > > > > > > > > > > Region
> > > > > > > > > > > > > > cache
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi, Slava
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks very much for your explanation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I can understand the app can wait all mbufs
> > > > > > > > > > > > > > are returned to the memory pool, and then it
> > > > > > > > > > > > > > can free this mbufs, I agree with
> > > > > > > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As a result, I will remove the bug fix patch
> > > > > > > > > > > > > > from this series and just replace the smp
> > > > > > > > > > > > > > barrier with
> > > > > > > > > > > > > > C11 thread fence. Thanks very much for your
> > > > > > > > > > > > > > patient explanation
> > > > > again.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best Regards
> > > > > > > > > > > > > > Feifei
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----邮件原件-----
> > > > > > > > > > > > > > > 发件人: Slava Ovsiienko
> > > > > > > > > > > > > > > <viacheslavo@nvidia.com>
> > > > > > > > > > > > > > > 发送时间: 2021年4月20日 2:51
> > > > > > > > > > > > > > > 收件人: Feifei Wang <Feifei.Wang2@arm.com>;
> > Matan
> > > > > > Azrad
> > > > > > > > > > > > > > > <matan@nvidia.com>; Shahaf Shuler
> > > > > > > > > > > > > > > <shahafs@nvidia.com>
> > > > > > > > > > > > > > > 抄送: dev@dpdk.org; nd <nd@arm.com>;
> > > > > stable@dpdk.org;
> > > > > > > > > > Ruifeng
> > > > > > > > > > > > > Wang
> > > > > > > > > > > > > > > <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> > > > > > > > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild
> > > > > > > > > > > > > > > bug for Memory Region cache
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please, see below
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ....
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi, Feifei
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes.
> > > > > > > > > > > > > > > > > Do we have some issue/bug with MR cache
> > > > > > > > > > > > > > > > > in
> > > > practice?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch fixes the bug which is based on
> > > > > > > > > > > > > > > > logical deduction, and it doesn't actually happen.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Each Tx queue has its own dedicated "local"
> > > > > > > > > > > > > > > > > cache for MRs to convert buffer address
> > > > > > > > > > > > > > > > > in mbufs being transmitted to LKeys
> > > > > > > > > > > > > > > > > (HW-related entity
> > > > > > > > > > > > > > > > > handle) and the "global" cache for all
> > > > > > > > > > > > > > > > > MR registered on the
> > > > > > > > > > device.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > AFAIK, how conversion happens in datapath:
> > > > > > > > > > > > > > > > > - check the local queue cache flush
> > > > > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > > > - lookup in local cache
> > > > > > > > > > > > > > > > > - if not found:
> > > > > > > > > > > > > > > > > - acquire lock for global cache read
> > > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > > - lookup in global cache
> > > > > > > > > > > > > > > > > - release lock for global cache
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > How cache update on memory
> > > > > > > > > > > > > > > > > freeing/unregistering
> > > > > > > > happens:
> > > > > > > > > > > > > > > > > - acquire lock for global cache write
> > > > > > > > > > > > > > > > > access
> > > > > > > > > > > > > > > > > - [a] remove relevant MRs from the
> > > > > > > > > > > > > > > > > global cache
> > > > > > > > > > > > > > > > > - [b] set local caches flush request
> > > > > > > > > > > > > > > > > - free global cache lock
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If I understand correctly, your patch
> > > > > > > > > > > > > > > > > swaps [a] and [b], and local caches
> > > > > > > > > > > > > > > > > flush is requested
> > > earlier.
> > > > > > > > > > > > > > > > > What problem does it
> > > > > > > > > > solve?
> > > > > > > > > > > > > > > > > It is not supposed there are in datapath
> > > > > > > > > > > > > > > > > some mbufs referencing to the memory
> > > > > > > > > > > > > > > > > being
> > > freed.
> > > > > > > > > > > > > > > > > Application must ensure this and must
> > > > > > > > > > > > > > > > > not allocate new mbufs from this memory
> > > > > > > > > > > > > > > > > regions
> > > > > > > > > > > > being freed.
> > > > > > > > > > > > > > > > > Hence, the lookups for these MRs in
> > > > > > > > > > > > > > > > > caches should not
> > > > > > > > occur.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For your first point that, application can
> > > > > > > > > > > > > > > > take charge of preventing MR freed memory
> > > > > > > > > > > > > > > > being allocated to
> > > > > > data path.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Does it means that If there is an
> > > > > > > > > > > > > > > > emergency of MR fragment, such as hotplug,
> > > > > > > > > > > > > > > > the application must inform thedata path
> > > > > > > > > > > > > > > > in advance, and this memory will not be
> > > > > > > > > > > > > > > > allocated, and then the control path will
> > > > > > > > > > > > > > > > free this memory? If application  can do
> > > > > > > > > > > > > > > > like this, I agree that this bug
> > > > > > > > > > > > > > cannot happen.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Actually,  this is the only correct way for
> > > > > > > > > > > > > > > application to
> > > > > > operate.
> > > > > > > > > > > > > > > Let's suppose we have some memory area that
> > > > > > > > > > > > > > > application wants to
> > > > > > > > > > > > free.
> > > > > > > > > > > > > > > ALL references to this area must be removed.
> > > > > > > > > > > > > > > If we have some mbufs allocated from this
> > > > > > > > > > > > > > > area, it means that we have memory pool
> > > > > > > > > > > > > > > created
> > > > > > > > > > > > > there.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > What application should do:
> > > > > > > > > > > > > > > - notify all its components/agents the
> > > > > > > > > > > > > > > memory area is going to be freed
> > > > > > > > > > > > > > > - all components/agents free the mbufs they
> > > > > > > > > > > > > > > might own
> > > > > > > > > > > > > > > - PMD might not support freeing for some
> > > > > > > > > > > > > > > mbufs (for example being sent and awaiting
> > > > > > > > > > > > > > > for completion), so app should just wait
> > > > > > > > > > > > > > > - wait till all mbufs are returned to the
> > > > > > > > > > > > > > > memory pool (by monitoring available obj ==
> > > > > > > > > > > > > > > pool size)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Otherwise - it is dangerous to free the memory.
> > > > > > > > > > > > > > > There are just some mbufs still allocated,
> > > > > > > > > > > > > > > it is regardless to buf address to MR
> > > > > > > > > > > > > > > translation. We just can't free the memory -
> > > > > > > > > > > > > > > the mapping will be destroyed and might
> > > > > > > > > > > > > > > cause the segmentation fault by SW or some
> > > > > > > > > > > > > > > HW issues on DMA access to unmapped memory.
> > > > > > > > > > > > > > > It is very generic safety approach - do not
> > > > > > > > > > > > > > > free the memory that is still in
> > > > > > > > > > use.
> > > > > > > > > > > > > > > Hence, at the moment of freeing and
> > > > > > > > > > > > > > > unregistering the MR, there MUST BE NO any
> > > > > > > > > > > > > > mbufs in flight referencing to the addresses
> > > > > > > > > > > > > > being
> > freed.
> > > > > > > > > > > > > > > No translation to MR being invalidated can happen.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > For other side, the cache flush has
> > > > > > > > > > > > > > > > > negative effect
> > > > > > > > > > > > > > > > > - the local cache is getting empty and
> > > > > > > > > > > > > > > > > can't provide translation for other
> > > > > > > > > > > > > > > > > valid (not being
> > > > > > > > > > > > > > > > > removed) MRs, and the translation has to
> > > > > > > > > > > > > > > > > look up in the global cache, that is
> > > > > > > > > > > > > > > > > locked now for rebuilding, this causes
> > > > > > > > > > > > > > > > > the delays in datapatch
> > > > > > > > > > > > > > > > on acquiring global cache lock.
> > > > > > > > > > > > > > > > > So, I see some potential performance impact.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If above assumption is true, we can go to
> > > > > > > > > > > > > > > > your second
> > > > > > point.
> > > > > > > > > > > > > > > > I think this is a problem of the tradeoff
> > > > > > > > > > > > > > > > between cache coherence and
> > > > > > > > > > > > > > > performance.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I can understand your meaning that though
> > > > > > > > > > > > > > > > global cache has been changed, we should
> > > > > > > > > > > > > > > > keep the valid MR in local cache as long
> > > > > > > > > > > > > > > > as possible to ensure the fast
> > > > > > searching speed.
> > > > > > > > > > > > > > > > In the meanwhile, the local cache can be
> > > > > > > > > > > > > > > > rebuilt later to reduce its waiting time
> > > > > > > > > > > > > > > > for acquiring the global
> > > > > > cache lock.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > However,  this mechanism just ensures the
> > > > > > > > > > > > > > > > performance unchanged for the first few mbufs.
> > > > > > > > > > > > > > > > During the next mbufs lkey searching after
> > 'dev_gen'
> > > > > > > > > > > > > > > > updated, it is still necessary to update
> > > > > > > > > > > > > > > > the local
> > cache.
> > > > > > > > > > > > > > > > And the performance can firstly reduce and
> > > > > > > > > > > > > > > > then
> > > > returns.
> > > > > > > > > > > > > > > > Thus, no matter whether there is this
> > > > > > > > > > > > > > > > patch or not, the performance will jitter
> > > > > > > > > > > > > > > > in a certain period of
> > > > > > > > > > > > > time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Local cache should be updated to remove MRs
> > > > > > > > > > > > > > > no longer
> > > > > > valid.
> > > > > > > > > > > > > > > But we just flush the entire cache.
> > > > > > > > > > > > > > > Let's suppose we have valid MR0, MR1, and
> > > > > > > > > > > > > > > not valid MRX in local
> > > > > > > > > > > cache.
> > > > > > > > > > > > > > > And there are traffic in the datapath for
> > > > > > > > > > > > > > > MR0 and MR1, and no traffic for MRX anymore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1) If we do as you propose:
> > > > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > > > b) request flush local cache first - all
> > > > > > > > > > > > > > > MR0, MR1, MRX will be removed on translation
> > > > > > > > > > > > > > > in datapath
> > > > > > > > > > > > > > > c) update global cache,
> > > > > > > > > > > > > > > d) free lock All the traffic for valid MR0,
> > > > > > > > > > > > > > > MR1 ALWAYS will be blocked on lock taken for
> > > > > > > > > > > > > > > cache update since point
> > > > > > > > > > > > > > > b) till point
> > > > > > > d).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2) If we do as it is implemented now:
> > > > > > > > > > > > > > > a) take a lock
> > > > > > > > > > > > > > > b) update global cache
> > > > > > > > > > > > > > > c) request flush local cache
> > > > > > > > > > > > > > > d) free lock The traffic MIGHT be locked
> > > > > > > > > > > > > > > ONLY for MRs non-existing in local cache
> > > > > > > > > > > > > > > (not happens for
> > > > > > > > > > > > > > > MR0 and MR1, must not happen for MRX), and
> > > > > > > > > > > > > > > probability should be minor. And lock might
> > > > > > > > > > > > > > > happen since
> > > > > > > > > > > > > > > c) till
> > > > > > > > > > > > > > > d)
> > > > > > > > > > > > > > > - quite short period of time
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Summary, the difference between 1) and 2)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lock probability:
> > > > > > > > > > > > > > > - 1) lock ALWAYS happen for ANY MR
> > > > > > > > > > > > > > > translation after
> > > b),
> > > > > > > > > > > > > > >   2) lock MIGHT happen, for cache miss ONLY,
> > > > > > > > > > > > > > > after
> > > > > > > > > > > > > > > c)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Lock duration:
> > > > > > > > > > > > > > > - 1) lock since b) till d),
> > > > > > > > > > > > > > >   2) lock since c) till d), that seems to be
> > > > > > > > > > > > > > > much
> > shorter.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Finally, in conclusion, I tend to think
> > > > > > > > > > > > > > > > that the bottom layer can do more things
> > > > > > > > > > > > > > > > to ensure the correct execution of the
> > > > > > > > > > > > > > > > program, which may have a negative impact
> > > > > > > > > > > > > > > > on the performance in a short time, but in
> > > > > > > > > > > > > > > > the long run, the performance will
> > > > > > > > > > > > > > > > eventually
> > > > > > > > > > > > come back.
> > > > > > > > > > > > > > > > Furthermore, maybe we should pay attention
> > > > > > > > > > > > > > > > to the performance in the stable period,
> > > > > > > > > > > > > > > > and try our best to ensure the correctness
> > > > > > > > > > > > > > > > of the program in case of
> > > > > > > > > > > > > > > emergencies.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If we have some mbufs still allocated in
> > > > > > > > > > > > > > > memory being freed
> > > > > > > > > > > > > > > - there is nothing to say about correctness,
> > > > > > > > > > > > > > > it is totally incorrect. In my opinion, we
> > > > > > > > > > > > > > > should not think how to mitigate this
> > > > > > > > > > > > > > > incorrect behavior, we should not encourage
> > > > > > > > > > > > > > > application developers to follow the wrong
> > > > > > > > > > > > > > approaches.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With best regards, Slava
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best Regards Feifei
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > With best regards, Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v1 4/4] net/mlx5: replace SMP barriers with C11 barriers
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
                   ` (2 preceding siblings ...)
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache Feifei Wang
@ 2021-03-18  7:18 ` Feifei Wang
  2021-04-07  1:45 ` [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Alexander Kozyrev
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-03-18  7:18 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko
  Cc: dev, nd, Feifei Wang, Ruifeng Wang, Honnappa Nagarahalli

Replace SMP barrier with atomic thread fence.

Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
---
 drivers/net/mlx5/mlx5_mr.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index 7ce1d3e64..650fe9093 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -109,11 +109,11 @@ mlx5_mr_mem_event_free_cb(struct mlx5_dev_ctx_shared *sh,
 
 		/*
 		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is to keep the order that dev_gen updated before
+		 * release-fence is to keep the order that dev_gen updated before
 		 * rebuilding global cache. Therefore, other core can flush their
 		 * local cache on time.
 		 */
-		rte_smp_wmb();
+		rte_atomic_thread_fence(__ATOMIC_RELEASE);
 		mlx5_mr_rebuild_cache(&sh->share_cache);
 	}
 	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
@@ -412,11 +412,11 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 
 	/*
 	 * Flush local caches by propagating invalidation across cores.
-	 * rte_smp_wmb() is to keep the order that dev_gen updated before
+	 * release-fence is to keep the order that dev_gen updated before
 	 * rebuilding global cache. Therefore, other core can flush their
 	 * local cache on time.
 	 */
-	rte_smp_wmb();
+	rte_atomic_thread_fence(__ATOMIC_RELEASE);
 	mlx5_mr_rebuild_cache(&sh->share_cache);
 	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
                   ` (3 preceding siblings ...)
  2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 4/4] net/mlx5: replace SMP barriers with C11 barriers Feifei Wang
@ 2021-04-07  1:45 ` Alexander Kozyrev
  2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
  2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
  6 siblings, 0 replies; 36+ messages in thread
From: Alexander Kozyrev @ 2021-04-07  1:45 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd

Looks good to me. Thank you for fixing this bug for us.

Reviewed-by: Alexander Kozyrev <akozyrev@nvidia.com>

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Feifei Wang
> Sent: Thursday, March 18, 2021 3:19
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>
> Subject: [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx
> 
> For net/mlx4 and net/mlx5, fix cache rebuild bug and replace SMP
> barriers with atomic fence.
> 
> Feifei Wang (4):
>   net/mlx4: fix rebuild bug for Memory Region cache
>   net/mlx4: replace SMP barrier with C11 barriers
>   net/mlx5: fix rebuild bug for Memory Region cache
>   net/mlx5: replace SMP barriers with C11 barriers
> 
>  drivers/net/mlx4/mlx4_mr.c | 21 +++++++++----------
>  drivers/net/mlx5/mlx5_mr.c | 41 ++++++++++++++++++--------------------
>  2 files changed, 28 insertions(+), 34 deletions(-)
> 
> --
> 2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] remove wmb for net/mlx
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
                   ` (4 preceding siblings ...)
  2021-04-07  1:45 ` [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Alexander Kozyrev
@ 2021-05-17 10:00 ` Feifei Wang
  2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
  2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: " Feifei Wang
  2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
  6 siblings, 2 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-17 10:00 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

For net/mlx4 and net/mlx5, remove unnecessary wmb for Memory Region
cache.

v2:
1. keep the order of dev_gen and global cache (Slava Ovsiienko)
2. remove the wmb at last instead of moving it forward
3. remove atomic_thread_fence patches

Feifei Wang (2):
  net/mlx4: remove unnecessary wmb for Memory Region cache
  net/mlx5: remove unnecessary wmb for Memory Region cache

 drivers/net/mlx4/mlx4_mr.c | 13 +++++--------
 drivers/net/mlx5/mlx5_mr.c | 26 ++++++++++----------------
 2 files changed, 15 insertions(+), 24 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache
  2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
@ 2021-05-17 10:00   ` Feifei Wang
  2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: " Feifei Wang
  1 sibling, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-17 10:00 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Feifei Wang, Ruifeng Wang

'dev_gen' is a variable to inform other cores to flush their local cache
when global cache is rebuilt. It is unnecessary to add write memory
barrier (wmb) before or after its updating for synchronization.

This is due to MR cache's R/W lock can maintain synchronization between
threads:
1. dev_gen and global cache update ordering inside the lock protected
section does not matter. Because other threads cannot take the lock
until global cache has been updated. Thus, in out of order platform,
even if other agents firstly observed updated dev_gen but global does
not update, they also needs to wait the lock. As a result, it is
unnecessary to add a wmb between rebuiling global cache and updating
dev_gen to keep the order of rebuilding global cache and updating
dev_gen.

2. Store-Release of unlock can provide the implicit wmb at the level
visible by software. This makes 'rebuiling global cache' and 'updating
dev_gen' be observed before local_cache starts to be updated by other
agents. Thus, wmb after 'updating dev_gen' can be removed.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx4/mlx4_mr.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_mr.c b/drivers/net/mlx4/mlx4_mr.c
index 6b2f0cf187..9a396f5729 100644
--- a/drivers/net/mlx4/mlx4_mr.c
+++ b/drivers/net/mlx4/mlx4_mr.c
@@ -948,18 +948,15 @@ mlx4_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr, size_t len)
 	if (rebuild) {
 		mr_rebuild_dev_cache(dev);
 		/*
-		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
+		 * No wmb is needed after updating dev_gen due to store-release of
+		 * unlock can provide the implicit wmb at the level visible by
+		 * software. This makes rebuilt global cache and updated dev_gen
+		 * be observed when local_cache starts to be updating by other
+		 * agents.
 		 */
 		++priv->mr.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
 		      priv->mr.dev_gen);
-		rte_smp_wmb();
 	}
 	rte_rwlock_write_unlock(&priv->mr.rwlock);
 #ifdef RTE_LIBRTE_MLX4_DEBUG
-- 
2.25.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
  2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
@ 2021-05-17 10:00   ` Feifei Wang
  2021-05-17 14:15     ` Slava Ovsiienko
  1 sibling, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-17 10:00 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko
  Cc: dev, nd, Feifei Wang, Ruifeng Wang

'dev_gen' is a variable to inform other cores to flush their local cache
when global cache is rebuilt. It is unnecessary to add write memory
barrier (wmb) before or after its updating for synchronization.

This is due to MR cache's R/W lock can maintain synchronization between
threads:
1. dev_gen and global cache update ordering inside the lock protected
section does not matter. Because other threads cannot take the lock
until global cache has been updated. Thus, in out of order platform,
even if other agents firstly observed updated dev_gen but global does
not update, they also needs to wait the lock. As a result, it is
unnecessary to add a wmb between rebuiling global cache and updating
dev_gen to keep the order of rebuilding global cache and updating
dev_gen.

2. Store-Release of unlock can provide the implicit wmb at the level
visible by software. This makes 'rebuiling global cache' and 'updating
dev_gen' be observed before local_cache starts to be updated by other
agents. Thus, wmb after 'updating dev_gen' can be removed.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_mr.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index e791b6338d..85e5865050 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -107,18 +107,15 @@ mlx5_mr_mem_event_free_cb(struct mlx5_dev_ctx_shared *sh,
 	if (rebuild) {
 		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
-		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
+		 * No wmb is needed after updating dev_gen due to store-release of
+		 * unlock can provide the implicit wmb at the level visible by
+		 * software. This makes rebuilt global cache and updated dev_gen
+		 * be observed when local_cache starts to be updating by other
+		 * agents.
 		 */
 		++sh->share_cache.dev_gen;
 		DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
 		      sh->share_cache.dev_gen);
-		rte_smp_wmb();
 	}
 	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
@@ -411,18 +408,15 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	      (void *)mr);
 	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
-	 * Flush local caches by propagating invalidation across cores.
-	 * rte_smp_wmb() is enough to synchronize this event. If one of
-	 * freed memsegs is seen by other core, that means the memseg
-	 * has been allocated by allocator, which will come after this
-	 * free call. Therefore, this store instruction (incrementing
-	 * generation below) will be guaranteed to be seen by other core
-	 * before the core sees the newly allocated memory.
+	 * No wmb is needed after updating dev_gen due to store-release of
+	 * unlock can provide the implicit wmb at the level visible by
+	 * software. This makes rebuilt global cache and updated dev_gen
+	 * be observed when local_cache starts to be updating by other
+	 * agents.
 	 */
 	++sh->share_cache.dev_gen;
 	DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
 	      sh->share_cache.dev_gen);
-	rte_smp_wmb();
 	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: " Feifei Wang
@ 2021-05-17 14:15     ` Slava Ovsiienko
  2021-05-18  8:52       ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-17 14:15 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Ruifeng Wang

Hi, Feifei

Thanks you for the patch.
Please, see my notes below about typos and minor commit message rewording.

> -----Original Message-----
> From: Feifei Wang <feifei.wang2@arm.com>
> Sent: Monday, May 17, 2021 13:00
> To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory
> Region cache
> 
> 'dev_gen' is a variable to inform other cores to flush their local cache when
> global cache is rebuilt. It is unnecessary to add write memory barrier (wmb)
> before or after its updating for synchronization.
> 
Would it be better "to trigger all cores to flush their local caches once the global
MR cache has been rebuilt"  ? 

> This is due to MR cache's R/W lock can maintain synchronization between
> threads:
I would add empty line here.
> 1. dev_gen and global cache update ordering inside the lock protected
> section does not matter. Because other threads cannot take the lock until
> global cache has been updated. Thus, in out of order platform, even if other
> agents firstly observed updated dev_gen but global does not update, they
> also needs to wait the lock. As a result, it is unnecessary to add a wmb
Type: "need" (no S) -> "have to" would be better ? 

> between rebuiling global cache and updating dev_gen to keep the order

rebuiling -> rebuilDing
And let's reword a little bit?
"wmb between global cache rebuilding and updating the dev_gen to keep the memory store order."

> rebuilding global cache and updating dev_gen.
> 
> 2. Store-Release of unlock can provide the implicit wmb at the level visible by
can provide -> provides

> software. This makes 'rebuiling global cache' and 'updating dev_gen' be
Typo: rebuiling -> rebuilDing


> observed before local_cache starts to be updated by other agents. Thus,
> wmb after 'updating dev_gen' can be removed.
> 
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/mlx5/mlx5_mr.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c index
> e791b6338d..85e5865050 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -107,18 +107,15 @@ mlx5_mr_mem_event_free_cb(struct
> mlx5_dev_ctx_shared *sh,
>  	if (rebuild) {
>  		mlx5_mr_rebuild_cache(&sh->share_cache);
>  		/*
> -		 * Flush local caches by propagating invalidation across cores.
> -		 * rte_smp_wmb() is enough to synchronize this event. If
> one of
> -		 * freed memsegs is seen by other core, that means the
> memseg
> -		 * has been allocated by allocator, which will come after this
> -		 * free call. Therefore, this store instruction (incrementing
> -		 * generation below) will be guaranteed to be seen by other
> core
> -		 * before the core sees the newly allocated memory.
> +		 * No wmb is needed after updating dev_gen due to store-
> release of
> +		 * unlock can provide the implicit wmb at the level visible by
> +		 * software. This makes rebuilt global cache and updated
> dev_gen
> +		 * be observed when local_cache starts to be updating by
> other
> +		 * agents.
>  		 */
Let's make comment a less wordy (and try to keep source code concise), what about this?
"No explicit wmb is needed after updating dev_gen due to store-release ordering
in unlock that provides the implicit barrier at the software visible level."

>  		++sh->share_cache.dev_gen;
>  		DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
>  		      sh->share_cache.dev_gen);
> -		rte_smp_wmb();
>  	}
>  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>  }
> @@ -411,18 +408,15 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>  	      (void *)mr);
>  	mlx5_mr_rebuild_cache(&sh->share_cache);
>  	/*
> -	 * Flush local caches by propagating invalidation across cores.
> -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> -	 * freed memsegs is seen by other core, that means the memseg
> -	 * has been allocated by allocator, which will come after this
> -	 * free call. Therefore, this store instruction (incrementing
> -	 * generation below) will be guaranteed to be seen by other core
> -	 * before the core sees the newly allocated memory.
> +	 * No wmb is needed after updating dev_gen due to store-release of
> +	 * unlock can provide the implicit wmb at the level visible by
> +	 * software. This makes rebuilt global cache and updated dev_gen
> +	 * be observed when local_cache starts to be updating by other
> +	 * agents.
The same as previous comment above.

Please, apply the same comments to the mlx4 patch:
http://patches.dpdk.org/project/dpdk/patch/20210517100002.19905-2-feifei.wang2@arm.com/

With best regards,
Slava


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-17 14:15     ` Slava Ovsiienko
@ 2021-05-18  8:52       ` Feifei Wang
  0 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-18  8:52 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Ruifeng Wang, nd

Hi， Slava

> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月17日 22:15
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory
> Region cache
> 
> Hi, Feifei
> 
> Thanks you for the patch.
> Please, see my notes below about typos and minor commit message
> rewording.

Thanks very much for your very careful reviewing.
I will apply these in the next version.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
> 
> > -----Original Message-----
> > From: Feifei Wang <feifei.wang2@arm.com>
> > Sent: Monday, May 17, 2021 13:00
> > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> > Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH v2 2/2] net/mlx5: remove unnecessary wmb for Memory
> > Region cache
> >
> > 'dev_gen' is a variable to inform other cores to flush their local
> > cache when global cache is rebuilt. It is unnecessary to add write
> > memory barrier (wmb) before or after its updating for synchronization.
> >
> Would it be better "to trigger all cores to flush their local caches once the
> global MR cache has been rebuilt"  ?
> 
1.Yes, I think this can be more clear.

> > This is due to MR cache's R/W lock can maintain synchronization
> > between
> > threads:
> I would add empty line here.
2.Done.

> > 1. dev_gen and global cache update ordering inside the lock protected
> > section does not matter. Because other threads cannot take the lock
> > until global cache has been updated. Thus, in out of order platform,
> > even if other agents firstly observed updated dev_gen but global does
> > not update, they also needs to wait the lock. As a result, it is
> > unnecessary to add a wmb
> Type: "need" (no S) -> "have to" would be better ?
> 
3.Done.

> > between rebuiling global cache and updating dev_gen to keep the order
> 
> rebuiling -> rebuilding
4.Done.

> And let's reword a little bit?
> "wmb between global cache rebuilding and updating the dev_gen to keep
> the memory store order."
> 
5.Done.

> > rebuilding global cache and updating dev_gen.
> >
> > 2. Store-Release of unlock can provide the implicit wmb at the level
> > visible by
> can provide -> provides
> 
6.Done.

> > software. This makes 'rebuiling global cache' and 'updating dev_gen'
> > be
> Typo: rebuiling -> rebuilding
7.Done.

> 
> 
> > observed before local_cache starts to be updated by other agents.
> > Thus, wmb after 'updating dev_gen' can be removed.
> >
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/mlx5/mlx5_mr.c | 26 ++++++++++----------------
> >  1 file changed, 10 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> > index
> > e791b6338d..85e5865050 100644
> > --- a/drivers/net/mlx5/mlx5_mr.c
> > +++ b/drivers/net/mlx5/mlx5_mr.c
> > @@ -107,18 +107,15 @@ mlx5_mr_mem_event_free_cb(struct
> > mlx5_dev_ctx_shared *sh,
> >  	if (rebuild) {
> >  		mlx5_mr_rebuild_cache(&sh->share_cache);
> >  		/*
> > -		 * Flush local caches by propagating invalidation across cores.
> > -		 * rte_smp_wmb() is enough to synchronize this event. If
> > one of
> > -		 * freed memsegs is seen by other core, that means the
> > memseg
> > -		 * has been allocated by allocator, which will come after this
> > -		 * free call. Therefore, this store instruction (incrementing
> > -		 * generation below) will be guaranteed to be seen by other
> > core
> > -		 * before the core sees the newly allocated memory.
> > +		 * No wmb is needed after updating dev_gen due to store-
> > release of
> > +		 * unlock can provide the implicit wmb at the level visible by
> > +		 * software. This makes rebuilt global cache and updated
> > dev_gen
> > +		 * be observed when local_cache starts to be updating by
> > other
> > +		 * agents.
> >  		 */
> Let's make comment a less wordy (and try to keep source code concise),
> what about this?
> "No explicit wmb is needed after updating dev_gen due to store-release
> ordering in unlock that provides the implicit barrier at the software visible
> level."
8.That's better than before. A concise comment works better in the code.

> 
> >  		++sh->share_cache.dev_gen;
> >  		DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
> >  		      sh->share_cache.dev_gen);
> > -		rte_smp_wmb();
> >  	}
> >  	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> >  }
> > @@ -411,18 +408,15 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void
> > *addr,
> >  	      (void *)mr);
> >  	mlx5_mr_rebuild_cache(&sh->share_cache);
> >  	/*
> > -	 * Flush local caches by propagating invalidation across cores.
> > -	 * rte_smp_wmb() is enough to synchronize this event. If one of
> > -	 * freed memsegs is seen by other core, that means the memseg
> > -	 * has been allocated by allocator, which will come after this
> > -	 * free call. Therefore, this store instruction (incrementing
> > -	 * generation below) will be guaranteed to be seen by other core
> > -	 * before the core sees the newly allocated memory.
> > +	 * No wmb is needed after updating dev_gen due to store-release of
> > +	 * unlock can provide the implicit wmb at the level visible by
> > +	 * software. This makes rebuilt global cache and updated dev_gen
> > +	 * be observed when local_cache starts to be updating by other
> > +	 * agents.
> The same as previous comment above.
9.Done.

> 
> Please, apply the same comments to the mlx4 patch:
> http://patches.dpdk.org/project/dpdk/patch/20210517100002.19905-2-
> feifei.wang2@arm.com/
> 
10.Done.

Best Regards
Feifei

> With best regards,
> Slava


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx
  2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
                   ` (5 preceding siblings ...)
  2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
@ 2021-05-18  8:50 ` Feifei Wang
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
                     ` (2 more replies)
  6 siblings, 3 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-18  8:50 UTC (permalink / raw)
  Cc: dev, nd, Feifei Wang

For net/mlx4 and net/mlx5, remove unnecessary wmb for Memory Region
cache.

v2:
1. keep the order of dev_gen and global cache (Slava Ovsiienko)
2. remove the wmb at last instead of moving it forward
3. remove atomic_thread_fence patches

v3:
1. commit message rewording (Slava Ovsiienko)

Feifei Wang (2):
  net/mlx4: remove unnecessary wmb for Memory Region cache
  net/mlx5: remove unnecessary wmb for Memory Region cache

 drivers/net/mlx4/mlx4_mr.c | 11 +++--------
 drivers/net/mlx5/mlx5_mr.c | 22 ++++++----------------
 2 files changed, 9 insertions(+), 24 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache
  2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
@ 2021-05-18  8:50   ` Feifei Wang
  2021-05-18 12:13     ` Slava Ovsiienko
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: " Feifei Wang
  2021-05-27  8:37   ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Raslan Darawsheh
  2 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-18  8:50 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Feifei Wang, Ruifeng Wang

'dev_gen' is a variable to trigger all cores to flush their local caches
once the global MR cache has been rebuilt.

This is due to MR cache's R/W lock can maintain synchronization between
threads:

1. dev_gen and global cache updating ordering inside the lock protected
section does not matter. Because other threads cannot take the lock
until global cache has been updated. Thus, in out of order platform,
even if other agents firstly observe updated dev_gen but global does
not update, they still have to wait the lock. As a result, it is
unnecessary to add a wmb between global cache rebuilding and updating
the dev_gen to keep the memory store order.

2. Store-Release of unlock provides the implicit wmb at the level
visible by software. This makes 'rebuilding global cache' and 'updating
dev_gen' be observed before local_cache starts to be updated by other
agents. Thus, wmb after 'updating dev_gen' can be removed.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx4/mlx4_mr.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/mlx4/mlx4_mr.c b/drivers/net/mlx4/mlx4_mr.c
index 6b2f0cf187..2274b5df19 100644
--- a/drivers/net/mlx4/mlx4_mr.c
+++ b/drivers/net/mlx4/mlx4_mr.c
@@ -948,18 +948,13 @@ mlx4_mr_mem_event_free_cb(struct rte_eth_dev *dev, const void *addr, size_t len)
 	if (rebuild) {
 		mr_rebuild_dev_cache(dev);
 		/*
-		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
+		 * No explicit wmb is needed after updating dev_gen due to
+		 * store-release ordering in unlock that provides the
+		 * implicit barrier at the software visible level.
 		 */
 		++priv->mr.dev_gen;
 		DEBUG("broadcasting local cache flush, gen=%d",
 		      priv->mr.dev_gen);
-		rte_smp_wmb();
 	}
 	rte_rwlock_write_unlock(&priv->mr.rwlock);
 #ifdef RTE_LIBRTE_MLX4_DEBUG
-- 
2.25.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
@ 2021-05-18 12:13     ` Slava Ovsiienko
  0 siblings, 0 replies; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-18 12:13 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Ruifeng Wang

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Feifei Wang
> Sent: Tuesday, May 18, 2021 11:51
> To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for
> Memory Region cache
> 
> 'dev_gen' is a variable to trigger all cores to flush their local caches once the
> global MR cache has been rebuilt.
> 
> This is due to MR cache's R/W lock can maintain synchronization between
> threads:
> 
> 1. dev_gen and global cache updating ordering inside the lock protected
> section does not matter. Because other threads cannot take the lock until
> global cache has been updated. Thus, in out of order platform, even if other
> agents firstly observe updated dev_gen but global does not update, they still
> have to wait the lock. As a result, it is unnecessary to add a wmb between
> global cache rebuilding and updating the dev_gen to keep the memory store
> order.
> 
> 2. Store-Release of unlock provides the implicit wmb at the level visible by
> software. This makes 'rebuilding global cache' and 'updating dev_gen' be
> observed before local_cache starts to be updated by other agents. Thus,
> wmb after 'updating dev_gen' can be removed.
> 
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
@ 2021-05-18  8:50   ` Feifei Wang
  2021-05-18 10:17     ` Slava Ovsiienko
  2021-05-27  8:37   ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Raslan Darawsheh
  2 siblings, 1 reply; 36+ messages in thread
From: Feifei Wang @ 2021-05-18  8:50 UTC (permalink / raw)
  To: Matan Azrad, Shahaf Shuler, Viacheslav Ovsiienko
  Cc: dev, nd, Feifei Wang, Ruifeng Wang

'dev_gen' is a variable to trigger all cores to flush their local caches
once the global MR cache has been rebuilt.

This is due to MR cache's R/W lock can maintain synchronization between
threads:

1. dev_gen and global cache updating ordering inside the lock protected
section does not matter. Because other threads cannot take the lock
until global cache has been updated. Thus, in out of order platform,
even if other agents firstly observe updated dev_gen but global does
not update, they also have to wait the lock. As a result, it is
unnecessary to add a wmb between global cache rebuilding and updating
the dev_gen to keep the memory store order.

2. Store-Release of unlock provides the implicit wmb at the level
visible by software. This makes 'rebuilding global cache' and 'updating
dev_gen' be observed before local_cache starts to be updated by other
agents. Thus, wmb after 'updating dev_gen' can be removed.

Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_mr.c | 22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
index e791b6338d..0c5403e493 100644
--- a/drivers/net/mlx5/mlx5_mr.c
+++ b/drivers/net/mlx5/mlx5_mr.c
@@ -107,18 +107,13 @@ mlx5_mr_mem_event_free_cb(struct mlx5_dev_ctx_shared *sh,
 	if (rebuild) {
 		mlx5_mr_rebuild_cache(&sh->share_cache);
 		/*
-		 * Flush local caches by propagating invalidation across cores.
-		 * rte_smp_wmb() is enough to synchronize this event. If one of
-		 * freed memsegs is seen by other core, that means the memseg
-		 * has been allocated by allocator, which will come after this
-		 * free call. Therefore, this store instruction (incrementing
-		 * generation below) will be guaranteed to be seen by other core
-		 * before the core sees the newly allocated memory.
+		 * No explicit wmb is needed after updating dev_gen due to
+		 * store-release ordering in unlock that provides the
+		 * implicit barrier at the software visible level.
 		 */
 		++sh->share_cache.dev_gen;
 		DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
 		      sh->share_cache.dev_gen);
-		rte_smp_wmb();
 	}
 	rte_rwlock_write_unlock(&sh->share_cache.rwlock);
 }
@@ -411,18 +406,13 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void *addr,
 	      (void *)mr);
 	mlx5_mr_rebuild_cache(&sh->share_cache);
 	/*
-	 * Flush local caches by propagating invalidation across cores.
-	 * rte_smp_wmb() is enough to synchronize this event. If one of
-	 * freed memsegs is seen by other core, that means the memseg
-	 * has been allocated by allocator, which will come after this
-	 * free call. Therefore, this store instruction (incrementing
-	 * generation below) will be guaranteed to be seen by other core
-	 * before the core sees the newly allocated memory.
+	 * No explicit wmb is needed after updating dev_gen due to
+	 * store-release ordering in unlock that provides the
+	 * implicit barrier at the software visible level.
 	 */
 	++sh->share_cache.dev_gen;
 	DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
 	      sh->share_cache.dev_gen);
-	rte_smp_wmb();
 	rte_rwlock_read_unlock(&sh->share_cache.rwlock);
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: " Feifei Wang
@ 2021-05-18 10:17     ` Slava Ovsiienko
  2021-05-19  1:54       ` [dpdk-dev] 回复: " Feifei Wang
  0 siblings, 1 reply; 36+ messages in thread
From: Slava Ovsiienko @ 2021-05-18 10:17 UTC (permalink / raw)
  To: Feifei Wang, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Ruifeng Wang

> -----Original Message-----
> From: Feifei Wang <feifei.wang2@arm.com>
> Sent: Tuesday, May 18, 2021 11:51
> To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory
> Region cache
> 
> 'dev_gen' is a variable to trigger all cores to flush their local caches once the
> global MR cache has been rebuilt.
> 
> This is due to MR cache's R/W lock can maintain synchronization between
> threads:
> 
> 1. dev_gen and global cache updating ordering inside the lock protected
> section does not matter. Because other threads cannot take the lock until
> global cache has been updated. Thus, in out of order platform, even if other
> agents firstly observe updated dev_gen but global does not update, they
> also have to wait the lock. As a result, it is unnecessary to add a wmb
> between global cache rebuilding and updating the dev_gen to keep the
> memory store order.
> 
> 2. Store-Release of unlock provides the implicit wmb at the level visible by
> software. This makes 'rebuilding global cache' and 'updating dev_gen' be
> observed before local_cache starts to be updated by other agents. Thus,
> wmb after 'updating dev_gen' can be removed.
> 
> Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Thanks a lot for patience and cooperation.
With best regards,
Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [dpdk-dev] 回复: [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory Region cache
  2021-05-18 10:17     ` Slava Ovsiienko
@ 2021-05-19  1:54       ` Feifei Wang
  0 siblings, 0 replies; 36+ messages in thread
From: Feifei Wang @ 2021-05-19  1:54 UTC (permalink / raw)
  To: Slava Ovsiienko, Matan Azrad, Shahaf Shuler; +Cc: dev, nd, Ruifeng Wang, nd

Also thanks for your patient communication and explanation.
A wonderful discussion~

Best Regards
Feifei
> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年5月18日 18:18
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> 主题: RE: [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory
> Region cache
> 
> > -----Original Message-----
> > From: Feifei Wang <feifei.wang2@arm.com>
> > Sent: Tuesday, May 18, 2021 11:51
> > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>;
> > Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH v3 2/2] net/mlx5: remove unnecessary wmb for Memory
> > Region cache
> >
> > 'dev_gen' is a variable to trigger all cores to flush their local
> > caches once the global MR cache has been rebuilt.
> >
> > This is due to MR cache's R/W lock can maintain synchronization
> > between
> > threads:
> >
> > 1. dev_gen and global cache updating ordering inside the lock
> > protected section does not matter. Because other threads cannot take
> > the lock until global cache has been updated. Thus, in out of order
> > platform, even if other agents firstly observe updated dev_gen but
> > global does not update, they also have to wait the lock. As a result,
> > it is unnecessary to add a wmb between global cache rebuilding and
> > updating the dev_gen to keep the memory store order.
> >
> > 2. Store-Release of unlock provides the implicit wmb at the level
> > visible by software. This makes 'rebuilding global cache' and
> > 'updating dev_gen' be observed before local_cache starts to be updated
> > by other agents. Thus, wmb after 'updating dev_gen' can be removed.
> >
> > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> 
> Thanks a lot for patience and cooperation.
> With best regards,
> Slava

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx
  2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
  2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: " Feifei Wang
@ 2021-05-27  8:37   ` Raslan Darawsheh
  2 siblings, 0 replies; 36+ messages in thread
From: Raslan Darawsheh @ 2021-05-27  8:37 UTC (permalink / raw)
  To: Feifei Wang; +Cc: dev, nd

Hi,

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Feifei Wang
> Sent: Tuesday, May 18, 2021 11:51 AM
> Cc: dev@dpdk.org; nd@arm.com; Feifei Wang <feifei.wang2@arm.com>
> Subject: [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx
> 
> For net/mlx4 and net/mlx5, remove unnecessary wmb for Memory Region
> cache.
> 
> v2:
> 1. keep the order of dev_gen and global cache (Slava Ovsiienko) 2. remove
> the wmb at last instead of moving it forward 3. remove atomic_thread_fence
> patches
> 
> v3:
> 1. commit message rewording (Slava Ovsiienko)
> 
> Feifei Wang (2):
>   net/mlx4: remove unnecessary wmb for Memory Region cache
>   net/mlx5: remove unnecessary wmb for Memory Region cache
> 
>  drivers/net/mlx4/mlx4_mr.c | 11 +++--------  drivers/net/mlx5/mlx5_mr.c |
> 22 ++++++----------------
>  2 files changed, 9 insertions(+), 24 deletions(-)
> 
> --
> 2.25.1

Series applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-05-27  8:37 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18  7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache Feifei Wang
2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 2/4] net/mlx4: replace SMP barrier with C11 barriers Feifei Wang
2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache Feifei Wang
2021-04-12  8:27   ` Slava Ovsiienko
2021-04-13  5:20     ` [dpdk-dev] 回复: " Feifei Wang
2021-04-19 18:50       ` [dpdk-dev] " Slava Ovsiienko
2021-04-20  5:53         ` [dpdk-dev] 回复: " Feifei Wang
2021-04-20  7:29           ` Feifei Wang
2021-04-20  7:53             ` [dpdk-dev] " Slava Ovsiienko
2021-04-20  8:42               ` [dpdk-dev] 回复: " Feifei Wang
2021-05-06  2:52                 ` Feifei Wang
2021-05-06 11:21                   ` [dpdk-dev] " Slava Ovsiienko
2021-05-07  6:36                     ` [dpdk-dev] 回复: " Feifei Wang
2021-05-07 10:14                       ` [dpdk-dev] " Slava Ovsiienko
2021-05-08  3:13                         ` [dpdk-dev] 回复: " Feifei Wang
2021-05-11  8:18                           ` [dpdk-dev] " Slava Ovsiienko
2021-05-12  5:34                             ` [dpdk-dev] 回复: " Feifei Wang
2021-05-12 11:07                               ` [dpdk-dev] " Slava Ovsiienko
2021-05-13  5:49                                 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-13 10:49                                   ` [dpdk-dev] " Slava Ovsiienko
2021-05-14  5:18                                     ` [dpdk-dev] 回复: " Feifei Wang
2021-03-18  7:18 ` [dpdk-dev] [PATCH v1 4/4] net/mlx5: replace SMP barriers with C11 barriers Feifei Wang
2021-04-07  1:45 ` [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Alexander Kozyrev
2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
2021-05-17 10:00   ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: " Feifei Wang
2021-05-17 14:15     ` Slava Ovsiienko
2021-05-18  8:52       ` [dpdk-dev] 回复: " Feifei Wang
2021-05-18  8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
2021-05-18 12:13     ` Slava Ovsiienko
2021-05-18  8:50   ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: " Feifei Wang
2021-05-18 10:17     ` Slava Ovsiienko
2021-05-19  1:54       ` [dpdk-dev] 回复: " Feifei Wang
2021-05-27  8:37   ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Raslan Darawsheh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).