Re: [dpdk-dev] dev Digest, Vol 288, Issue 27

DPDK patches and discussions
 help / color / mirror / Atom feed

* Re: [dpdk-dev] dev Digest, Vol 288, Issue 27
       [not found] ` <d8993cb75592406a993e3119d45e906a@intel.com>
@ 2020-03-10  2:55   ` Zhang, XiX
  0 siblings, 0 replies; only message in thread
From: Zhang, XiX @ 2020-03-10  2:55 UTC (permalink / raw)
  To: dev

[PATCH] vfio: map contiguous areas in one go
Test-by : xix.zhang@intel.com
Thanks

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of dev- 
> request@dpdk.org
> Sent: February 25, 2020 21:50
> To: dev@dpdk.org
> Subject: dev Digest, Vol 288, Issue 27
> 
> Send dev mailing list submissions to
> 	dev@dpdk.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://mails.dpdk.org/listinfo/dev
> or, via email, send a message with subject or body 'help' to
> 	dev-request@dpdk.org
> 
> You can reach the person managing the list at
> 	dev-owner@dpdk.org
> 
> When replying, please edit your Subject line so it is more specific than "Re:
> Contents of dev digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: [PATCH] doc/mlx5: update mlx5 guide (Thomas Monjalon)
>    2. Re: [PATCH] doc: describe the pktmbuf pool with pinned
>       extarnal memory (Thomas Monjalon)
>    3. [PATCH] vfio: map contiguous areas in one go (Anatoly Burakov)
>    4. Re: [RFC 0/6] New sync modes for ring (Ananyev, Konstantin)
>    5. Re: [PATCH] vfio: map contiguous areas in one go (Ray Kinsella)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 25 Feb 2020 14:19:10 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] doc/mlx5: update mlx5 guide
> Message-ID: <2125420.IFkqi6BYcA@xps>
> Content-Type: text/plain; charset="us-ascii"
> 
> 24/02/2020 18:57, Viacheslav Ovsiienko:
> > - metadata limitation is described
> > - no inline hint flag is described
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> 
> Split in 2 patches and applied, thanks
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 25 Feb 2020 14:22:13 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Cc: dev@dpdk.org, olivier.matz@6wind.com
> Subject: Re: [dpdk-dev] [PATCH] doc: describe the pktmbuf pool with
> 	pinned	extarnal memory
> Message-ID: <13189717.lVVuGzaMjS@xps>
> Content-Type: text/plain; charset="us-ascii"
> 
> 24/02/2020 18:58, Viacheslav Ovsiienko:
> > Document the new mbuf pools with external pinned buffers.
> >
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  doc/guides/prog_guide/mbuf_lib.rst     | 34
> ++++++++++++++++++++++++++++++++++
> 
> Please submit this documentation separately for review by Olivier.
> 
> >  doc/guides/rel_notes/release_20_02.rst |  5 +++++
> 
> Applying only the release notes in 20.02, thanks
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 25 Feb 2020 13:24:48 +0000
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID:
> 	<c6550c8e66dbc7a54a0b495ecda58b0eb79c07ca.1582637079.git.anat
> oly.burakov@intel.com>
> 
> 
> Currently, when we are creating DMA mappings for memory that's either 
> external or is backed by hugepages in IOVA as PA mode, we assume that 
> each page is necessarily discontiguous. This may not actually be the 
> case, especially for external memory, where the user is able to create 
> their own IOVA table and make it contiguous. This is a problem because 
> VFIO has a limited number of DMA mappings, and it does not appear to 
> concatenate them and treats each mapping as separate, even when they 
> cover adjacent areas.
> 
> Fix this so that we always map contiguous memory in a single chunk, as 
> opposed to mapping each segment separately.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/linux/eal/eal_vfio.c | 59 
> +++++++++++++++++++++++++----
>  1 file changed, 51 insertions(+), 8 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> index 01b5ef3f42..4502aefed3 100644
> --- a/lib/librte_eal/linux/eal/eal_vfio.c
> +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> @@ -514,9 +514,11 @@ static void
>  vfio_mem_event_callback(enum rte_mem_event type, const void *addr, 
> size_t len,
>  		void *arg __rte_unused)
>  {
> +	rte_iova_t iova_start, iova_expected;
>  	struct rte_memseg_list *msl;
>  	struct rte_memseg *ms;
>  	size_t cur_len = 0;
> +	uint64_t va_start;
> 
>  	msl = rte_mem_virt2memseg_list(addr);
> 
> @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum rte_mem_event type, 
> const void *addr, size_t len,  #endif
>  	/* memsegs are contiguous in memory */
>  	ms = rte_mem_virt2memseg(addr, msl);
> +
> +	/*
> +	 * This memory is not guaranteed to be contiguous, but it still could
> +	 * be, or it could have some small contiguous chunks. Since the
> number
> +	 * of VFIO mappings is limited, and VFIO appears to not concatenate
> +	 * adjacent mappings, we have to do this ourselves.
> +	 *
> +	 * So, find contiguous chunks, then map them.
> +	 */
> +	va_start = ms->addr_64;
> +	iova_start = iova_expected = ms->iova;
>  	while (cur_len < len) {
> +		bool new_contig_area = ms->iova != iova_expected;
> +		bool last_seg = (len - cur_len) == ms->len;
> +		bool skip_last = false;
> +
> +		/* only do mappings when current contiguous area ends */
> +		if (new_contig_area) {
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +						iova_start,
> +						iova_expected - iova_start, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +						iova_start,
> +						iova_expected - iova_start, 0);
> +			va_start = ms->addr_64;
> +			iova_start = ms->iova;
> +		}
>  		/* some memory segments may have invalid IOVA */
>  		if (ms->iova == RTE_BAD_IOVA) {
>  			RTE_LOG(DEBUG, EAL, "Memory segment at %p has bad IOVA, 
> skipping\n",
>  					ms->addr);
> -			goto next;
> +			skip_last = true;
>  		}
> -		if (type == RTE_MEM_EVENT_ALLOC)
> -			vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> -					ms->iova, ms->len, 1);
> -		else
> -			vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> -					ms->iova, ms->len, 0);
> -next:
> +		iova_expected = ms->iova + ms->len;
>  		cur_len += ms->len;
>  		++ms;
> +
> +		/*
> +		 * don't count previous segment, and don't attempt to
> +		 * dereference a potentially invalid pointer.
> +		 */
> +		if (skip_last && !last_seg) {
> +			iova_expected = iova_start = ms->iova;
> +			va_start = ms->addr_64;
> +		} else if (!skip_last && last_seg) {
> +			/* this is the last segment and we're not skipping */
> +			if (type == RTE_MEM_EVENT_ALLOC)
> +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +						iova_start,
> +						iova_expected - iova_start, 1);
> +			else
> +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> +						iova_start,
> +						iova_expected - iova_start, 0);
> +		}
>  	}
>  #ifdef RTE_ARCH_PPC_64
>  	cur_len = 0;
> --
> 2.17.1
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 25 Feb 2020 13:41:14 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Stephen Hemminger <stephen@networkplumber.org>, Jerin Jacob
> 	<jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>, Olivier Matz <olivier.matz@6wind.com>,
> 	"drc@linux.vnet.ibm.com" <drc@linux.vnet.ibm.com>
> Subject: Re: [dpdk-dev] [RFC 0/6] New sync modes for ring
> Message-ID:
> 	<SN6PR11MB255845FB6665C2A2D5D76D229AED0@SN6PR11MB2558.
> namprd11.prod.outlook.com>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> > > > > Upfront note - that RFC is not a complete patch.
> > > > > It introduces an ABI breakage, plus it doesn't update 
> > > > > ring_elem code properly, etc.
> > > > > I plan to deal with all these things in later versions.
> > > > > Right now I seek an initial feedback about proposed ideas.
> > > > > Would also ask people to repeat performance tests (see below) 
> > > > > on their platforms to confirm the impact.
> > > > >
> > > > > More and more customers use(/try to use) DPDK based apps 
> > > > > within overcommitted systems (multiple acttive threads over 
> > > > > same pysical
> cores):
> > > > > VM, container deployments, etc.
> > > > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > > > LHP is quite a common problem for spin-based sync primitives 
> > > > > (spin-locks, etc.) on overcommitted systems.
> > > > > The situation gets much worse when some sort of fair-locking 
> > > > > technique is used (ticket-lock, etc.).
> > > > > As now not only lock-owner but also lock-waiters scheduling 
> > > > > order matters a lot.
> > > > > This is a well-known problem for kernel within VMs:
> > > > > http://www-
> archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > > > https://www.cs.hs-
> rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > > > The problem with rte_ring is that while head accusion is sort 
> > > > > of un-fair locking, waiting on tail is very similar to ticket 
> > > > > lock schema - tail has to be updated in particular order.
> > > > > That makes current rte_ring implementation to perform really 
> > > > > pure on some overcommited scenarios.
> > > >
> > > > Rather than reform rte_ring to fit this scenario, it would make 
> > > > more sense to me to introduce another primitive.
> 
> I don't see much advantages it will bring us.
> As a disadvantages, for developers and maintainers - code duplication, 
> for end users - extra code churn and removed ability to mix and match 
> different sync modes in one ring.
> 
> > The current lockless
> > > > ring performs very well for the isolated thread model that DPDK 
> > > > was built around. This looks like a case of customers violating 
> > > > the usage model of the DPDK and then being surprised at the fallout.
> 
> For customers using isolated thread model - nothing should change 
> (both in terms of API and performance).
> Existing sync modes MP/MC,SP/SC kept untouched, set up in the same way 
> (via flags and _init_), and MP/MC remains as default one.
> >From other side I don't see why we should ignore customers that want 
> >to
> use
> their DPDK apps in different deployment scenarios.
> 
> > >
> > > I agree with Stephen here.
> > >
> > > I think, adding more runtime check in the enqueue() and dequeue() 
> > > will have a bad effect on the low-end cores too.
> 
> We do have a run-time check in our current enqueue()/dequeue 
> implementation.
> In fact we support both modes: we have generic
> rte_ring_enqueue(/dequeue)_bulk(/burst)
> where sync behaviour is determined at runtime by value of 
> prod(/cons).single.
> Or user can call  rte_ring_(mp/sp)_enqueue_* functions directly.
> This RFC follows exactly the same paradigm:
> rte_ring_enqueue(/dequeue)_bulk(/burst) kept generic and it's 
> behaviour is determined at runtime, by value of prod(/cons).sync_type.
> Or user can call enqueue/dequeue with particular sync mode directly:
> rte_ring_(mp/sp/rts/hts)_enqueue_(bulk/burst)*.
> The only thing that changed:
>  Format of prod/cons now could differ depending on mode selected at _init_.
>  So you can't create a ring for let say SP mode and then in the middle 
> of data- path  change your mind and start using MP_RTS mode.
>  For existing modes (SP/MP, SC/MC)  format remains the same and user 
> can still  use them interchangeably, though of course that is an error 
> prone practice.
> 
> > > But I agree with the problem statement that in the virtualization 
> > > use case, It may be possible to have N virtual cores runs on a 
> > > physical core.
> > >
> > > IMO, The best solution would be keeping the ring API same and have 
> > > a different flavor in "compile-time". Something like liburcu did 
> > > for accommodating different flavors.
> > >
> > > i.e urcu-qsbr.h and urcu-bp.h will identical definition of API. 
> > > The application can simply include ONE header file in a C file 
> > > based on the flavor.
> 
> I don't think it is a flexible enough approach.
> In one app user might need to have several rings with different sync modes.
> Or even user might need a ring with different sync modes for 
> enqueue/dequeue.
> 
> > > If need both at runtime. Need to have function pointer or so in 
> > > the application and define the function in different c file by 
> > > including the approaite flavor in C file.
> 
> Big issue with function pointers here would be DPDK MP model.
> AFAIK,  rte_ring is quite popular mechanism for IPC between DPDK apps.
> To support such model, we'll need to split rte_ring data into 'shared'
> and 'private' and initialize private one for every process that is going to use it.
> That sounds like a massive change, and I am not sure the required 
> effort will worth it.
> BTW, if user just calls API functions without trying to access 
> structure internals directly, I don't think it would be a big 
> difference for him what is inside:
> indirect function call or inlined switch(...) {}.
> 
> > This would also be a good time to consider the tradeoffs of the 
> > heavy use of inlining that is done in rte_ring vs the impact that 
> > has on API/ABI stability.
> 
> Yes, hiding rte_ring implementation inside .c would help a lot in 
> terms of ABI maintenance and would make our future life easier.
> The question is what is the price for it in terms of performance, and 
> are we ready to pay it. Not to mention that it would cause changes in 
> many other libs/apps...
> So I think it should be a subject for a separate discussion.
> But, agree it would be good at least to measure the performance impact 
> of such change.
> If I'll have some spare cycles, will give it a try.
> Meanwhile, can I ask Jerin and other guys to repeat tests from this 
> RFC on their HW? Before continuing discussion would probably be good 
> to know does the suggested patch work as expected across different platforms.
> Thanks
> Konstantin
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 25 Feb 2020 13:49:38 +0000
> From: Ray Kinsella <mdr@ashroe.eu>
> To: Anatoly Burakov <anatoly.burakov@intel.com>, dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] vfio: map contiguous areas in one go
> Message-ID: <276c85ed-ac2f-52c9-dffc-18ce41ab0f35@ashroe.eu>
> Content-Type: text/plain; charset=utf-8
> 
> Hi Anatoly,
> 
> On 25/02/2020 13:24, Anatoly Burakov wrote:
> > Currently, when we are creating DMA mappings for memory that's 
> > either external or is backed by hugepages in IOVA as PA mode, we 
> > assume that each page is necessarily discontiguous. This may not 
> > actually be the case, especially for external memory, where the user 
> > is able to create their own IOVA table and make it contiguous. This 
> > is a problem because VFIO has a limited number of DMA mappings, and 
> > it does not appear to concatenate them and treats each mapping as 
> > separate, even when they cover adjacent areas.
> > > Fix this so that we always map contiguous memory in a single
> > chunk, as opposed to mapping each segment separately.
> 
> Can I confirm my understanding.
> 
> We are essentially correcting user errant behavior, trading off 
> startup/mapping time to save IOMMU resources?
> 
> >
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> >  lib/librte_eal/linux/eal/eal_vfio.c | 59 
> > +++++++++++++++++++++++++----
> >  1 file changed, 51 insertions(+), 8 deletions(-)
> >
> > diff --git a/lib/librte_eal/linux/eal/eal_vfio.c
> b/lib/librte_eal/linux/eal/eal_vfio.c
> > index 01b5ef3f42..4502aefed3 100644
> > --- a/lib/librte_eal/linux/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linux/eal/eal_vfio.c
> > @@ -514,9 +514,11 @@ static void
> >  vfio_mem_event_callback(enum rte_mem_event type, const void *addr,
> size_t len,
> >  		void *arg __rte_unused)
> >  {
> > +	rte_iova_t iova_start, iova_expected;
> >  	struct rte_memseg_list *msl;
> >  	struct rte_memseg *ms;
> >  	size_t cur_len = 0;
> > +	uint64_t va_start;
> >
> >  	msl = rte_mem_virt2memseg_list(addr);
> >
> > @@ -545,22 +547,63 @@ vfio_mem_event_callback(enum
> rte_mem_event type, const void *addr, size_t len,
> >  #endif
> >  	/* memsegs are contiguous in memory */
> >  	ms = rte_mem_virt2memseg(addr, msl);
> > +
> > +	/*
> > +	 * This memory is not guaranteed to be contiguous, but it still could
> > +	 * be, or it could have some small contiguous chunks. Since the
> number
> > +	 * of VFIO mappings is limited, and VFIO appears to not concatenate
> > +	 * adjacent mappings, we have to do this ourselves.
> > +	 *
> > +	 * So, find contiguous chunks, then map them.
> > +	 */
> > +	va_start = ms->addr_64;
> > +	iova_start = iova_expected = ms->iova;
> >  	while (cur_len < len) {
> > +		bool new_contig_area = ms->iova != iova_expected;
> > +		bool last_seg = (len - cur_len) == ms->len;
> > +		bool skip_last = false;
> > +
> > +		/* only do mappings when current contiguous area ends */
> > +		if (new_contig_area) {
> > +			if (type == RTE_MEM_EVENT_ALLOC)
> > +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +						iova_start,
> > +						iova_expected - iova_start, 1);
> > +			else
> > +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +						iova_start,
> > +						iova_expected - iova_start, 0);
> > +			va_start = ms->addr_64;
> > +			iova_start = ms->iova;
> > +		}
> >  		/* some memory segments may have invalid IOVA */
> >  		if (ms->iova == RTE_BAD_IOVA) {
> >  			RTE_LOG(DEBUG, EAL, "Memory segment at %p has
> bad IOVA, skipping\n",
> >  					ms->addr);
> > -			goto next;
> > +			skip_last = true;
> >  		}
> > -		if (type == RTE_MEM_EVENT_ALLOC)
> > -			vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > -					ms->iova, ms->len, 1);
> > -		else
> > -			vfio_dma_mem_map(default_vfio_cfg, ms-
> >addr_64,
> > -					ms->iova, ms->len, 0);
> > -next:
> > +		iova_expected = ms->iova + ms->len;
> >  		cur_len += ms->len;
> >  		++ms;
> > +
> > +		/*
> > +		 * don't count previous segment, and don't attempt to
> > +		 * dereference a potentially invalid pointer.
> > +		 */
> > +		if (skip_last && !last_seg) {
> > +			iova_expected = iova_start = ms->iova;
> > +			va_start = ms->addr_64;
> > +		} else if (!skip_last && last_seg) {
> > +			/* this is the last segment and we're not skipping */
> > +			if (type == RTE_MEM_EVENT_ALLOC)
> > +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +						iova_start,
> > +						iova_expected - iova_start, 1);
> > +			else
> > +				vfio_dma_mem_map(default_vfio_cfg,
> va_start,
> > +						iova_start,
> > +						iova_expected - iova_start, 0);
> > +		}
> >  	}
> >  #ifdef RTE_ARCH_PPC_64
> >  	cur_len = 0;
> >
> 
> 
> End of dev Digest, Vol 288, Issue 27
> ************************************

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-03-10  2:56 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <mailman.2366.1582638587.2727.dev@dpdk.org>
     [not found] ` <d8993cb75592406a993e3119d45e906a@intel.com>
2020-03-10  2:55   ` [dpdk-dev] dev Digest, Vol 288, Issue 27 Zhang, XiX

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).