From: Feifei Wang <Feifei.Wang2@arm.com>
To: Slava Ovsiienko <viacheslavo@nvidia.com>,
Matan Azrad <matan@nvidia.com>,
Shahaf Shuler <shahafs@nvidia.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>,
"stable@dpdk.org" <stable@dpdk.org>,
Ruifeng Wang <Ruifeng.Wang@arm.com>, nd <nd@arm.com>,
nd <nd@arm.com>
Subject: [dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
Date: Tue, 20 Apr 2021 05:53:33 +0000 [thread overview]
Message-ID: <DBBPR08MB44112C4F2F22A0178A6BE885C8489@DBBPR08MB4411.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <DM6PR12MB3753AF31F27A35F0BA3C14D0DF499@DM6PR12MB3753.namprd12.prod.outlook.com>
Hi, Slava
Thanks very much for your explanation.
I can understand the app can wait all mbufs are returned to the memory pool,
and then it can free this mbufs, I agree with this.
As a result, I will remove the bug fix patch from this series and just replace the smp barrier
with C11 thread fence. Thanks very much for your patient explanation again.
Best Regards
Feifei
> -----邮件原件-----
> 发件人: Slava Ovsiienko <viacheslavo@nvidia.com>
> 发送时间: 2021年4月20日 2:51
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>
> 抄送: dev@dpdk.org; nd <nd@arm.com>; stable@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
>
> Hi, Feifei
>
> Please, see below
>
> ....
>
> > > Hi, Feifei
> > >
> > > Sorry, I do not follow what this patch fixes. Do we have some
> > > issue/bug with MR cache in practice?
> >
> > This patch fixes the bug which is based on logical deduction, and it
> > doesn't actually happen.
> >
> > >
> > > Each Tx queue has its own dedicated "local" cache for MRs to convert
> > > buffer address in mbufs being transmitted to LKeys (HW-related
> > > entity
> > > handle) and the "global" cache for all MR registered on the device.
> > >
> > > AFAIK, how conversion happens in datapath:
> > > - check the local queue cache flush request
> > > - lookup in local cache
> > > - if not found:
> > > - acquire lock for global cache read access
> > > - lookup in global cache
> > > - release lock for global cache
> > >
> > > How cache update on memory freeing/unregistering happens:
> > > - acquire lock for global cache write access
> > > - [a] remove relevant MRs from the global cache
> > > - [b] set local caches flush request
> > > - free global cache lock
> > >
> > > If I understand correctly, your patch swaps [a] and [b], and local
> > > caches flush is requested earlier. What problem does it solve?
> > > It is not supposed there are in datapath some mbufs referencing to
> > > the memory being freed. Application must ensure this and must not
> > > allocate new mbufs from this memory regions being freed. Hence, the
> > > lookups for these MRs in caches should not occur.
> >
> > For your first point that, application can take charge of preventing
> > MR freed memory being allocated to data path.
> >
> > Does it means that If there is an emergency of MR fragment, such as
> > hotplug, the application must inform thedata path in advance, and this
> > memory will not be allocated, and then the control path will free this
> > memory? If application can do like this, I agree that this bug cannot happen.
>
> Actually, this is the only correct way for application to operate.
> Let's suppose we have some memory area that application wants to free. ALL
> references to this area must be removed. If we have some mbufs allocated
> from this area, it means that we have memory pool created there.
>
> What application should do:
> - notify all its components/agents the memory area is going to be freed
> - all components/agents free the mbufs they might own
> - PMD might not support freeing for some mbufs (for example being sent
> and awaiting for completion), so app should just wait
> - wait till all mbufs are returned to the memory pool (by monitoring available
> obj == pool size)
>
> Otherwise - it is dangerous to free the memory. There are just some mbufs
> still allocated, it is regardless to buf address to MR translation. We just can't
> free the memory - the mapping will be destroyed and might cause the
> segmentation fault by SW or some HW issues on DMA access to unmapped
> memory. It is very generic safety approach - do not free the memory that is
> still in use. Hence, at the moment of freeing and unregistering the MR, there
> MUST BE NO any mbufs in flight referencing to the addresses being freed.
> No translation to MR being invalidated can happen.
>
> >
> > > For other side, the cache flush has negative effect - the local
> > > cache is getting empty and can't provide translation for other valid
> > > (not being removed) MRs, and the translation has to look up in the
> > > global cache, that is locked now for rebuilding, this causes the
> > > delays in datapatch
> > on acquiring global cache lock.
> > > So, I see some potential performance impact.
> >
> > If above assumption is true, we can go to your second point. I think
> > this is a problem of the tradeoff between cache coherence and
> performance.
> >
> > I can understand your meaning that though global cache has been
> > changed, we should keep the valid MR in local cache as long as
> > possible to ensure the fast searching speed.
> > In the meanwhile, the local cache can be rebuilt later to reduce its
> > waiting time for acquiring the global cache lock.
> >
> > However, this mechanism just ensures the performance unchanged for
> > the first few mbufs.
> > During the next mbufs lkey searching after 'dev_gen' updated, it is
> > still necessary to update the local cache. And the performance can
> > firstly reduce and then returns. Thus, no matter whether there is this
> > patch or not, the performance will jitter in a certain period of time.
>
> Local cache should be updated to remove MRs no longer valid. But we just
> flush the entire cache.
> Let's suppose we have valid MR0, MR1, and not valid MRX in local cache.
> And there are traffic in the datapath for MR0 and MR1, and no traffic for MRX
> anymore.
>
> 1) If we do as you propose:
> a) take a lock
> b) request flush local cache first - all MR0, MR1, MRX will be removed on
> translation in datapath
> c) update global cache,
> d) free lock
> All the traffic for valid MR0, MR1 ALWAYS will be blocked on lock taken for
> cache update since point b) till point d).
>
> 2) If we do as it is implemented now:
> a) take a lock
> b) update global cache
> c) request flush local cache
> d) free lock
> The traffic MIGHT be locked ONLY for MRs non-existing in local cache (not
> happens for MR0 and MR1, must not happen for MRX), and probability
> should be minor. And lock might happen since c) till d) - quite short period of
> time
>
> Summary, the difference between 1) and 2)
>
> Lock probability:
> - 1) lock ALWAYS happen for ANY MR translation after b),
> 2) lock MIGHT happen, for cache miss ONLY, after c)
>
> Lock duration:
> - 1) lock since b) till d),
> 2) lock since c) till d), that seems to be much shorter.
>
> >
> > Finally, in conclusion, I tend to think that the bottom layer can do
> > more things to ensure the correct execution of the program, which may
> > have a negative impact on the performance in a short time, but in the
> > long run, the performance will eventually come back. Furthermore,
> > maybe we should pay attention to the performance in the stable period,
> > and try our best to ensure the correctness of the program in case of
> emergencies.
>
> If we have some mbufs still allocated in memory being freed - there is
> nothing to say about correctness, it is totally incorrect. In my opinion, we
> should not think how to mitigate this incorrect behavior, we should not
> encourage application developers to follow the wrong approaches.
>
> With best regards,
> Slava
>
> >
> > Best Regards
> > Feifei
> >
> > > With best regards,
> > > Slava
> > >
> > > > -----Original Message-----
> > > > From: Feifei Wang <feifei.wang2@arm.com>
> > > > Sent: Thursday, March 18, 2021 9:19
> > > > To: Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> > > > <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>;
> > > > Yongseok Koh <yskoh@mellanox.com>
> > > > Cc: dev@dpdk.org; nd@arm.com; Feifei Wang
> <feifei.wang2@arm.com>;
> > > > stable@dpdk.org; Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > > Region cache
> > > >
> > > > 'dev_gen' is a variable to inform other cores to flush their local
> > > > cache when global cache is rebuilt.
> > > >
> > > > However, if 'dev_gen' is updated after global cache is rebuilt,
> > > > other cores may load a wrong memory region lkey value from old
> > > > local
> > cache.
> > > >
> > > > Timeslot main core worker core
> > > > 1 rebuild global cache
> > > > 2 load unchanged dev_gen
> > > > 3 update dev_gen
> > > > 4 look up old local cache
> > > >
> > > > From the example above, we can see that though global cache is
> > > > rebuilt, due to that dev_gen is not updated, the worker core may
> > > > look up old cache table and receive a wrong memory region lkey value.
> > > >
> > > > To fix this, updating 'dev_gen' should be moved before rebuilding
> > > > global cache to inform worker cores to flush their local cache
> > > > when global cache start rebuilding. And wmb can ensure the
> > > > sequence of this
> > > process.
> > > >
> > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region support")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Suggested-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > Signed-off-by: Feifei Wang <feifei.wang2@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > > drivers/net/mlx5/mlx5_mr.c | 37
> > > > +++++++++++++++++--------------------
> > > > 1 file changed, 17 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_mr.c
> > > > b/drivers/net/mlx5/mlx5_mr.c index
> > > > da4e91fc2..7ce1d3e64 100644
> > > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > > @@ -103,20 +103,18 @@ mlx5_mr_mem_event_free_cb(struct
> > > > mlx5_dev_ctx_shared *sh,
> > > > rebuild = 1;
> > > > }
> > > > if (rebuild) {
> > > > - mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > + ++sh->share_cache.dev_gen;
> > > > + DEBUG("broadcasting local cache flush, gen=%d",
> > > > + sh->share_cache.dev_gen);
> > > > +
> > > > /*
> > > > * Flush local caches by propagating invalidation across cores.
> > > > - * rte_smp_wmb() is enough to synchronize this event. If
> > > > one of
> > > > - * freed memsegs is seen by other core, that means the
> > > > memseg
> > > > - * has been allocated by allocator, which will come after this
> > > > - * free call. Therefore, this store instruction (incrementing
> > > > - * generation below) will be guaranteed to be seen by other
> > > > core
> > > > - * before the core sees the newly allocated memory.
> > > > + * rte_smp_wmb() is to keep the order that dev_gen
> > > > updated before
> > > > + * rebuilding global cache. Therefore, other core can flush
> > > > their
> > > > + * local cache on time.
> > > > */
> > > > - ++sh->share_cache.dev_gen;
> > > > - DEBUG("broadcasting local cache flush, gen=%d",
> > > > - sh->share_cache.dev_gen);
> > > > rte_smp_wmb();
> > > > + mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > }
> > > > rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > > > }
> > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct rte_pci_device
> > *pdev,
> > > void
> > > > *addr,
> > > > mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb);
> > > > DEBUG("port %u remove MR(%p) from list", dev->data->port_id,
> > > > (void *)mr);
> > > > - mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > +
> > > > + ++sh->share_cache.dev_gen;
> > > > + DEBUG("broadcasting local cache flush, gen=%d",
> > > > + sh->share_cache.dev_gen);
> > > > +
> > > > /*
> > > > * Flush local caches by propagating invalidation across cores.
> > > > - * rte_smp_wmb() is enough to synchronize this event. If one of
> > > > - * freed memsegs is seen by other core, that means the memseg
> > > > - * has been allocated by allocator, which will come after this
> > > > - * free call. Therefore, this store instruction (incrementing
> > > > - * generation below) will be guaranteed to be seen by other core
> > > > - * before the core sees the newly allocated memory.
> > > > + * rte_smp_wmb() is to keep the order that dev_gen updated
> > > > before
> > > > + * rebuilding global cache. Therefore, other core can flush their
> > > > + * local cache on time.
> > > > */
> > > > - ++sh->share_cache.dev_gen;
> > > > - DEBUG("broadcasting local cache flush, gen=%d",
> > > > - sh->share_cache.dev_gen);
> > > > rte_smp_wmb();
> > > > + mlx5_mr_rebuild_cache(&sh->share_cache);
> > > > rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > > return 0;
> > > > }
> > > > --
> > > > 2.25.1
next prev parent reply other threads:[~2021-04-20 5:53 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-18 7:18 [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Feifei Wang
2021-03-18 7:18 ` [dpdk-dev] [PATCH v1 1/4] net/mlx4: fix rebuild bug for Memory Region cache Feifei Wang
2021-03-18 7:18 ` [dpdk-dev] [PATCH v1 2/4] net/mlx4: replace SMP barrier with C11 barriers Feifei Wang
2021-03-18 7:18 ` [dpdk-dev] [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache Feifei Wang
2021-04-12 8:27 ` Slava Ovsiienko
2021-04-13 5:20 ` [dpdk-dev] 回复: " Feifei Wang
2021-04-19 18:50 ` [dpdk-dev] " Slava Ovsiienko
2021-04-20 5:53 ` Feifei Wang [this message]
2021-04-20 7:29 ` [dpdk-dev] 回复: " Feifei Wang
2021-04-20 7:53 ` [dpdk-dev] " Slava Ovsiienko
2021-04-20 8:42 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-06 2:52 ` Feifei Wang
2021-05-06 11:21 ` [dpdk-dev] " Slava Ovsiienko
2021-05-07 6:36 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-07 10:14 ` [dpdk-dev] " Slava Ovsiienko
2021-05-08 3:13 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-11 8:18 ` [dpdk-dev] " Slava Ovsiienko
2021-05-12 5:34 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-12 11:07 ` [dpdk-dev] " Slava Ovsiienko
2021-05-13 5:49 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-13 10:49 ` [dpdk-dev] " Slava Ovsiienko
2021-05-14 5:18 ` [dpdk-dev] 回复: " Feifei Wang
2021-03-18 7:18 ` [dpdk-dev] [PATCH v1 4/4] net/mlx5: replace SMP barriers with C11 barriers Feifei Wang
2021-04-07 1:45 ` [dpdk-dev] [PATCH v1 0/4] refactor SMP barriers for net/mlx Alexander Kozyrev
2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 0/2] remove wmb " Feifei Wang
2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
2021-05-17 10:00 ` [dpdk-dev] [PATCH v2 2/2] net/mlx5: " Feifei Wang
2021-05-17 14:15 ` Slava Ovsiienko
2021-05-18 8:52 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-18 8:50 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Feifei Wang
2021-05-18 8:50 ` [dpdk-dev] [PATCH v3 1/2] net/mlx4: remove unnecessary wmb for Memory Region cache Feifei Wang
2021-05-18 12:13 ` Slava Ovsiienko
2021-05-18 8:50 ` [dpdk-dev] [PATCH v3 2/2] net/mlx5: " Feifei Wang
2021-05-18 10:17 ` Slava Ovsiienko
2021-05-19 1:54 ` [dpdk-dev] 回复: " Feifei Wang
2021-05-27 8:37 ` [dpdk-dev] [PATCH v3 0/2] remove wmb for net/mlx Raslan Darawsheh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DBBPR08MB44112C4F2F22A0178A6BE885C8489@DBBPR08MB4411.eurprd08.prod.outlook.com \
--to=feifei.wang2@arm.com \
--cc=Ruifeng.Wang@arm.com \
--cc=dev@dpdk.org \
--cc=matan@nvidia.com \
--cc=nd@arm.com \
--cc=shahafs@nvidia.com \
--cc=stable@dpdk.org \
--cc=viacheslavo@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).