* Re: [dpdk-dev] dev Digest, Vol 288, Issue 27
[not found] ` <d8993cb75592406a993e3119d45e906a@intel.com>
@ 2020-03-10 2:55 ` Zhang, XiX
0 siblings, 0 replies; only message in thread
From: Zhang, XiX @ 2020-03-10 2:55 UTC (permalink / raw)
To: dev
[PATCH] vfio: map contiguous areas in one go
Test-by : xix.zhang@intel.com
Thanks
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of dev-
> request@dpdk.org
> Sent: February 25, 2020 21:50
> To: dev@dpdk.org
> Subject: dev Digest, Vol 288, Issue 27
>
> Send dev mailing list submissions to
> dev@dpdk.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mails.dpdk.org/listinfo/dev
> or, via email, send a message with subject or body 'help' to
> dev-request@dpdk.org
>
> You can reach the person managing the list at
> dev-owner@dpdk.org
>
> When replying, please edit your Subject line so it is more specific than "Re:
> Contents of dev digest..."
>
>
> Today's Topics:
>
> 1. Re: [PATCH] doc/mlx5: update mlx5 guide (Thomas Monjalon)
> 2. Re: [PATCH] doc: describe the pktmbuf pool with pinned
> extarnal memory (Thomas Monjalon)
> 3. [PATCH] vfio: map contiguous areas in one go (Anatoly Burakov)
> 4. Re: [RFC 0/6] New sync modes for ring (Ananyev, Konstantin)
> 5. Re: [PATCH] vfio: map contiguous areas in one go (Ray Kinsella)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 25 Feb 2020 14:19:10 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] doc/mlx5: update mlx5 guide
> Message-ID: <2125420.IFkqi6BYcA@xps>
> Content-Type: text/plain; charset="us-ascii"
>
> 24/02/2020 18:57, Viacheslav Ovsiienko:
> > - metadata limitation is described
> > - no inline hint flag is described
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
>
> Split in 2 patches and applied, thanks
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 25 Feb 2020 14:22:13 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Cc: dev@dpdk.org, olivier.matz@6wind.com
> Subject: Re: [dpdk-dev] [PATCH] doc: describe the pktmbuf pool with
> pinned extarnal memory
> Message-ID: <13189717.lVVuGzaMjS@xps>
> Content-Type: text/plain; charset="us-ascii"
>
> 24/02/2020 18:58, Viacheslav Ovsiienko:
> > Document the new mbuf pools with external pinned buffers.
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> > doc/guides/prog_guide/mbuf_lib.rst | 34
> ++++++++++++++++++++++++++++++++++
>
> Please submit this documentation separately for review by Olivier.
>
> > doc/guides/rel_notes/release_20_02.rst | 5 +++++
>
> Applying only the release notes in 20.02, thanks
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 25 Feb 2020 13:24:48 +0000
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID:
> <c6550c8e66dbc7a54a0b495ecda58b0eb79c07ca.1582637079.git.anat
> oly.burakov@intel.com>
>
>
> Currently, when we are creating DMA mappings for memory that's either
> external or is backed by hugepages in IOVA as PA mode, we assume that
> each page is necessarily discontiguous. This may not actually be the
> case, especially for external memory, where the user is able to create
> their own IOVA table and make it contiguous. This is a problem because
> VFIO has a limited number of DMA mappings, and it does not appear to
> concatenate them and treats each mapping as separate, even when they
> cover adjacent areas.
>
> Fix this so that we always map contiguous memory in a single chunk, as
> opposed to mapping each segment separately.
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> lib/librte_eal/linux/eal/eal_vfio.c | 59
> +++++++++++++++++++++++++----
> 1 file changed, 51 insertions(+), 8 deletions(-)
>
> diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> index 01b5ef3f42..4502aefed3 100644
> --- a/lib/librte_eal/linux/eal/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> @@ -514,9 +514,11 @@ static void
> vfio_mem_event_callback(enum rte_mem_event type, const void *addr,
> size_t len,
> void *arg __rte_unused)
> {
> + rte_iova_t iova_start, iova_expected;
> struct rte_memseg_list *msl;
> struct rte_memseg *ms;
> size_t cur_len = 0;
> + uint64_t va_start;
>
> msl = rte_mem_virt2memseg_list(addr);
>
> @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum rte_mem_event type,
> const void *addr, size_t len, #endif
> /* memsegs are contiguous in memory */
> ms = rte_mem_virt2memseg(addr, msl);
> +
> + /*
> + * This memory is not guaranteed to be contiguous, but it still could
> + * be, or it could have some small contiguous chunks. Since the
> number
> + * of VFIO mappings is limited, and VFIO appears to not concatenate
> + * adjacent mappings, we have to do this ourselves.
> + *
> + * So, find contiguous chunks, then map them.
> + */
> + va_start = ms->addr_64;
> + iova_start = iova_expected = ms->iova;
> while (cur_len < len) {
> + bool new_contig_area = ms->iova != iova_expected;
> + bool last_seg = (len - cur_len) == ms->len;
> + bool skip_last = false;
> +
> + /* only do mappings when current contiguous area ends */
> + if (new_contig_area) {
> + if (type == RTE_MEM_EVENT_ALLOC)
> + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> + iova_start,
> + iova_expected - iova_start, 1);
> + else
> + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> + iova_start,
> + iova_expected - iova_start, 0);
> + va_start = ms->addr_64;
> + iova_start = ms->iova;
> + }
> /* some memory segments may have invalid IOVA */
> if (ms->iova == RTE_BAD_IOVA) {
> RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA,
> skipping\n",
> ms->addr);
> - goto next;
> + skip_last = true;
> }
> - if (type == RTE_MEM_EVENT_ALLOC)
> - vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> - ms->iova, ms->len, 1);
> - else
> - vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> - ms->iova, ms->len, 0);
> -next:
> + iova_expected = ms->iova + ms->len;
> cur_len += ms->len;
> ++ms;
> +
> + /*
> + * don't count previous segment, and don't attempt to
> + * dereference a potentially invalid pointer.
> + */
> + if (skip_last && !last_seg) {
> + iova_expected = iova_start = ms->iova;
> + va_start = ms->addr_64;
> + } else if (!skip_last && last_seg) {
> + /* this is the last segment and we're not skipping */
> + if (type == RTE_MEM_EVENT_ALLOC)
> + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> + iova_start,
> + iova_expected - iova_start, 1);
> + else
> + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> + iova_start,
> + iova_expected - iova_start, 0);
> + }
> }
> #ifdef RTE_ARCH_PPC_64
> cur_len = 0;
> --
> 2.17.1
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 25 Feb 2020 13:41:14 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Stephen Hemminger <stephen@networkplumber.org>, Jerin Jacob
> <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>, Olivier Matz <olivier.matz@6wind.com>,
> "drc@linux.vnet.ibm.com" <drc@linux.vnet.ibm.com>
> Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> Message-ID:
> <SN6PR11MB255845FB6665C2A2D5D76D229AED0@SN6PR11MB2558.
> namprd11.prod.outlook.com>
>
> Content-Type: text/plain; charset="us-ascii"
>
> > > > > Upfront note - that RFC is not a complete patch.
> > > > > It introduces an ABI breakage, plus it doesn't update
> > > > > ring_elem code properly, etc.
> > > > > I plan to deal with all these things in later versions.
> > > > > Right now I seek an initial feedback about proposed ideas.
> > > > > Would also ask people to repeat performance tests (see below)
> > > > > on their platforms to confirm the impact.
> > > > >
> > > > > More and more customers use(/try to use) DPDK based apps
> > > > > within overcommitted systems (multiple acttive threads over
> > > > > same pysical
> cores):
> > > > > VM, container deployments, etc.
> > > > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > > > LHP is quite a common problem for spin-based sync primitives
> > > > > (spin-locks, etc.) on overcommitted systems.
> > > > > The situation gets much worse when some sort of fair-locking
> > > > > technique is used (ticket-lock, etc.).
> > > > > As now not only lock-owner but also lock-waiters scheduling
> > > > > order matters a lot.
> > > > > This is a well-known problem for kernel within VMs:
> > > > > http://www-
> archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > > https://www.cs.hs-
> rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > > The problem with rte_ring is that while head accusion is sort
> > > > > of un-fair locking, waiting on tail is very similar to ticket
> > > > > lock schema - tail has to be updated in particular order.
> > > > > That makes current rte_ring implementation to perform really
> > > > > pure on some overcommited scenarios.
> > > >
> > > > Rather than reform rte_ring to fit this scenario, it would make
> > > > more sense to me to introduce another primitive.
>
> I don't see much advantages it will bring us.
> As a disadvantages, for developers and maintainers - code duplication,
> for end users - extra code churn and removed ability to mix and match
> different sync modes in one ring.
>
> > The current lockless
> > > > ring performs very well for the isolated thread model that DPDK
> > > > was built around. This looks like a case of customers violating
> > > > the usage model of the DPDK and then being surprised at the fallout.
>
> For customers using isolated thread model - nothing should change
> (both in terms of API and performance).
> Existing sync modes MP/MC,SP/SC kept untouched, set up in the same way
> (via flags and _init_), and MP/MC remains as default one.
> >From other side I don't see why we should ignore customers that want
> >to
> use
> their DPDK apps in different deployment scenarios.
>
> > >
> > > I agree with Stephen here.
> > >
> > > I think, adding more runtime check in the enqueue() and dequeue()
> > > will have a bad effect on the low-end cores too.
>
> We do have a run-time check in our current enqueue()/dequeue
> implementation.
> In fact we support both modes: we have generic
> rte_ring_enqueue(/dequeue)_bulk(/burst)
> where sync behaviour is determined at runtime by value of
> prod(/cons).single.
> Or user can call rte_ring_(mp/sp)_enqueue_* functions directly.
> This RFC follows exactly the same paradigm:
> rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's
> behaviour is determined at runtime, by value of prod(/cons).sync_type.
> Or user can call enqueue/dequeue with particular sync mode directly:
> rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
> The only thing that changed:
> Format of prod/cons now could differ depending on mode selected at _init_.
> So you can't create a ring for let say SP mode and then in the middle
> of data- path change your mind and start using MP_RTS mode.
> For existing modes (SP/MP, SC/MC) format remains the same and user
> can still use them interchangeably, though of course that is an error
> prone practice.
>
> > > But I agree with the problem statement that in the virtualization
> > > use case, It may be possible to have N virtual cores runs on a
> > > physical core.
> > >
> > > IMO, The best solution would be keeping the ring API same and have
> > > a different flavor in "compile-time". Something like liburcu did
> > > for accommodating different flavors.
> > >
> > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API.
> > > The application can simply include ONE header file in a C file
> > > based on the flavor.
>
> I don't think it is a flexible enough approach.
> In one app user might need to have several rings with different sync modes.
> Or even user might need a ring with different sync modes for
> enqueue/dequeue.
>
> > > If need both at runtime. Need to have function pointer or so in
> > > the application and define the function in different c file by
> > > including the approaite flavor in C file.
>
> Big issue with function pointers here would be DPDK MP model.
> AFAIK, rte_ring is quite popular mechanism for IPC between DPDK apps.
> To support such model, we'll need to split rte_ring data into 'shared'
> and 'private' and initialize private one for every process that is going to use it.
> That sounds like a massive change, and I am not sure the required
> effort will worth it.
> BTW, if user just calls API functions without trying to access
> structure internals directly, I don't think it would be a big
> difference for him what is inside:
> indirect function call or inlined switch(...) {}.
>
> > This would also be a good time to consider the tradeoffs of the
> > heavy use of inlining that is done in rte_ring vs the impact that
> > has on API/ABI stability.
>
> Yes, hiding rte_ring implementation inside .c would help a lot in
> terms of ABI maintenance and would make our future life easier.
> The question is what is the price for it in terms of performance, and
> are we ready to pay it. Not to mention that it would cause changes in
> many other libs/apps...
> So I think it should be a subject for a separate discussion.
> But, agree it would be good at least to measure the performance impact
> of such change.
> If I'll have some spare cycles, will give it a try.
> Meanwhile, can I ask Jerin and other guys to repeat tests from this
> RFC on their HW? Before continuing discussion would probably be good
> to know does the suggested patch work as expected across different platforms.
> Thanks
> Konstantin
>
>
> ------------------------------
>
> Message: 5
> Date: Tue, 25 Feb 2020 13:49:38 +0000
> From: Ray Kinsella <mdr@ashroe.eu>
> To: Anatoly Burakov <anatoly.burakov@intel.com>, dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID: <276c85ed-ac2f-52c9-dffc-18ce41ab0f35@ashroe.eu>
> Content-Type: text/plain; charset=utf-8
>
> Hi Anatoly,
>
> On 25/02/2020 13:24, Anatoly Burakov wrote:
> > Currently, when we are creating DMA mappings for memory that's
> > either external or is backed by hugepages in IOVA as PA mode, we
> > assume that each page is necessarily discontiguous. This may not
> > actually be the case, especially for external memory, where the user
> > is able to create their own IOVA table and make it contiguous. This
> > is a problem because VFIO has a limited number of DMA mappings, and
> > it does not appear to concatenate them and treats each mapping as
> > separate, even when they cover adjacent areas.
> > > Fix this so that we always map contiguous memory in a single
> > chunk, as opposed to mapping each segment separately.
>
> Can I confirm my understanding.
>
> We are essentially correcting user errant behavior, trading off
> startup/mapping time to save IOMMU resources?
>
> >
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> > lib/librte_eal/linux/eal/eal_vfio.c | 59
> > +++++++++++++++++++++++++----
> > 1 file changed, 51 insertions(+), 8 deletions(-)
> >
> > diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> > index 01b5ef3f42..4502aefed3 100644
> > --- a/lib/librte_eal/linux/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> > @@ -514,9 +514,11 @@ static void
> > vfio_mem_event_callback(enum rte_mem_event type, const void *addr,
> size_t len,
> > void *arg __rte_unused)
> > {
> > + rte_iova_t iova_start, iova_expected;
> > struct rte_memseg_list *msl;
> > struct rte_memseg *ms;
> > size_t cur_len = 0;
> > + uint64_t va_start;
> >
> > msl = rte_mem_virt2memseg_list(addr);
> >
> > @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum
> rte_mem_event type, const void *addr, size_t len,
> > #endif
> > /* memsegs are contiguous in memory */
> > ms = rte_mem_virt2memseg(addr, msl);
> > +
> > + /*
> > + * This memory is not guaranteed to be contiguous, but it still could
> > + * be, or it could have some small contiguous chunks. Since the
> number
> > + * of VFIO mappings is limited, and VFIO appears to not concatenate
> > + * adjacent mappings, we have to do this ourselves.
> > + *
> > + * So, find contiguous chunks, then map them.
> > + */
> > + va_start = ms->addr_64;
> > + iova_start = iova_expected = ms->iova;
> > while (cur_len < len) {
> > + bool new_contig_area = ms->iova != iova_expected;
> > + bool last_seg = (len - cur_len) == ms->len;
> > + bool skip_last = false;
> > +
> > + /* only do mappings when current contiguous area ends */
> > + if (new_contig_area) {
> > + if (type == RTE_MEM_EVENT_ALLOC)
> > + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > + iova_start,
> > + iova_expected - iova_start, 1);
> > + else
> > + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > + iova_start,
> > + iova_expected - iova_start, 0);
> > + va_start = ms->addr_64;
> > + iova_start = ms->iova;
> > + }
> > /* some memory segments may have invalid IOVA */
> > if (ms->iova == RTE_BAD_IOVA) {
> > RTE_LOG(DEBUG, EAL, "Memory segment at %p has
> bad IOVA, skipping\n",
> > ms->addr);
> > - goto next;
> > + skip_last = true;
> > }
> > - if (type == RTE_MEM_EVENT_ALLOC)
> > - vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > - ms->iova, ms->len, 1);
> > - else
> > - vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > - ms->iova, ms->len, 0);
> > -next:
> > + iova_expected = ms->iova + ms->len;
> > cur_len += ms->len;
> > ++ms;
> > +
> > + /*
> > + * don't count previous segment, and don't attempt to
> > + * dereference a potentially invalid pointer.
> > + */
> > + if (skip_last && !last_seg) {
> > + iova_expected = iova_start = ms->iova;
> > + va_start = ms->addr_64;
> > + } else if (!skip_last && last_seg) {
> > + /* this is the last segment and we're not skipping */
> > + if (type == RTE_MEM_EVENT_ALLOC)
> > + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > + iova_start,
> > + iova_expected - iova_start, 1);
> > + else
> > + vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > + iova_start,
> > + iova_expected - iova_start, 0);
> > + }
> > }
> > #ifdef RTE_ARCH_PPC_64
> > cur_len = 0;
> >
>
>
> End of dev Digest, Vol 288, Issue 27
> ************************************
^ permalink raw reply [flat|nested] only message in thread