Re: [dpdk-dev] Aligned rte_mempool for storage applications

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Varghese, Vipin" <vipin.varghese@intel.com>
To: "Harris, James R" <james.r.harris@intel.com>,
	"Howell, Seth" <seth.howell@intel.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] Aligned rte_mempool for storage applications
Date: Wed, 27 Mar 2019 02:33:04 +0000	[thread overview]
Message-ID: <4C9E0AB70F954A408CC4ADDBF0F8FA7D4D3265CB@BGSMSX101.gar.corp.intel.com> (raw)
Message-ID: <20190327023304.x0keTny07b55PgLYlNv4YvDU7gMXCXEaSHVhffd7mOk@z> (raw)
In-Reply-To: <06C0B6E8-83E6-4513-BF5C-7EDAB50D1E1E@intel.com>

Thanks Jim for the consideration. 

I humbly suggested the ideas, since we had a similar issue when creating AF_XDP_ZC PMD. Happy to share ideas.

Thanks
Vipin Varghese

> -----Original Message-----
> From: Harris, James R
> Sent: Wednesday, March 27, 2019 12:29 AM
> To: Howell, Seth <seth.howell@intel.com>; Varghese, Vipin
> <vipin.varghese@intel.com>; dev@dpdk.org
> Subject: Re: Aligned rte_mempool for storage applications
> 
> 
> 
> On 3/26/19, 11:34 AM, "Howell, Seth" <seth.howell@intel.com> wrote:
> 
>     Hi Vipin,
> 
>     Thanks for your quick reply. I will respond to your queries in order.
>     1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our
> other buffers are much larger (>64KiB)
>     2. These buffers are used in the I/O path, so performance is very important.
> Allocating and freeing a buffer each time we use it could be pretty costly.
> 
> I think Vipin may have been suggesting allocating one (or multiple) very large
> buffers, and then splitting that buffer on 4KB boundaries in SPDK.  If so, that
> would still require SPDK to develop its own mempool-like feature to hold those
> buffers.  We'd really like to use the DPDK rte_mempool implementation rather
> than inventing our own.
> 
>     3. Could you describe the idea of an indirect buffer in more detail? I don't think
> I quite understand that concept. I know we couldn't use mbufs because we often
> have buffers that are larger than 64k. I think there are more reasons we don't use
> the mbuf structure in our use case, but am not familiar with all of them. Maybe
> Jim can explain those in more detail.
> 
> SPDK doesn't use rte_mbufs (except when absolutely required for things like
> DPDK cryptodev/compressdev).  Most of that data structure is filled with network
> packet related fields that would never be used for storage.  We could create our
> own very small data structure and do something similar to Vipin's indirect mbuf
> suggestion.  And I think this is what Vipin was starting to allude to in query #2.
> 
> It would be less optimal than a native aligned mempool because we'd be adding
> an extra pointer dereference on every get from the mempool - but probably only
> slightly less optimal.  Seth - let's sync up offline and see if we can quickly collect
> some benchmarking data to measure the performance impact of this extra
> dereference.
> 
> Thanks Vipin - this definitely gives us an alternative direction to investigate that
> we hadn't considered.
> 
> -Jim
> 
> 
> 
>     Thanks,
> 
>     Seth
>     -----Original Message-----
>     From: Varghese, Vipin
>     Sent: Monday, March 25, 2019 7:53 PM
>     To: Harris, James R <james.r.harris@intel.com>; Howell, Seth
> <seth.howell@intel.com>; dev@dpdk.org
>     Subject: RE: Aligned rte_mempool for storage applications
> 
>     Hi Seth,
> 
>     If I may I would like to suggest and ask a query on the mempool alignment
> details. Please find my suggestion and query inline to the email.
> 
>     Snipped
>     >
>     >     In SPDK, we use the rte_mempool struct for many internal structure
>     > collections. The per-thread cache and ease of allocation of mempools
>     > are very useful features.
>     >     Some of the collections we store in SPDK are pools of I/O buffers.
>     > Typically, these pools contain elements of at least 4096 bytes, and we
>     > would like them to be aligned to 4k for performance reasons.
>     Query-1> is the total memory required to be 4096 only (data portion)?
> 
>     >
>     > [Jim] Just to clarify Seth's point - the performance reasons are
>     > specifically to avoid wasteful memcopies.  The vast majority of NVMe
>     > SSDs in the market today do not have full scatter/gather support -
>     > rather they only support something called PRP (Physical Region Pages)
>     > which require all scatter gather elements except the first to be 4KB
>     > aligned.  There are other storage interfaces such as Linux AIO that also impose
> alignment restrictions.
>     >
>     > -Jim
>     >
>     >
>     >     Currently, the rte_mempool API doesn't support aligned mempool
>     > objects. This means that when we allocate a 4k buffer and want it
>     > aligned to 4k, we actually need to allocate an 8k buffer and calculate
>     > an offset into it each time we want to use it.
>     Query-2> why not create contiguous 4K aligned memory with rte_malloc?
> 
>     >     We recently did a proof of concept using the rte_mempool_ops hook
>     > where we allocated a mempool and populated it with aligned entries.
>     > This allowed us to retrieve aligned addresses directly from
>     > rte_mempool_get(), but didn't help with the allocation size.
>     >     Because the rte_mempool struct assumes that each element has a
>     > header attached to it, we still need to live up to that assumption for
>     > each object we create in a mempool. This means that the actual size of
>     > a buffer becomes 4k + 24 bytes. In order to get to our next aligned
>     > address, we need to add about 4k of padding to each element.
>     >     Modifying the current rte_mempool struct to allow entries without
>     > headers seems impossible since it would break rte_mempool_for_obj_iter
>     > and rte_mempool_from_obj. However I still think there is a lot of
>     > benefit to be gained from a mempool structure that supports aligned objects
> without headers.
>     >     I am wondering if DPDK would be open to us introducing an
>     > rte_mempool_aligned structure. This structure would essentially be a
>     > wrapper around a regular mempool struct. However, it would not require
>     > headers or trailers for each object in the pool.
>     Query-3> using mempool with 0 size for data portion we can either create a
> indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc
> areas.
> 
>     Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR
> summit 2019).
> 
>     Advantage: no change in mempool library, mbuf library, or rte_malloc.
> Application works with zero change.
> 
>     >
>     >     This structure would only be applicable to a subset of mempools
>     > with the following characteristics:
>     >     	1. mempools for which the following flags were set:
>     > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG ,
>     > MEMPOOL_F_NO_SPREAD
>     >     	2. mempools that do not require the use of the following
>     > functions rte_mempool_from_obj (requires a pointer to the mp in the
>     > header of each obj), rte_mempool_for_obj_iter.
>     >     	3. Any attempt to create this object when
>     > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we
>     > can't check the header cookies.
>     >
>     >     My thought would be that we could implement this data structure in
>     > a header and it would look something like this:
>     >
>     >     Struct rte_mempool_aligned {
>     >     	Struct rte_mempool mp;
>     >     	Size_t obj_alignment;
>     >     };
>     >
>     >     The rest of the functions in the header would primarily be
>     > wrappers around the original functions. Most functions
>     > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue,
>     > rte_mempool_get_count, etc.) could be implemented directly as
>     > wrappers, and others such as rte_mempool_create and the populate
>     > functions would have to be re-implemented to some degree in the new
>     > header. The remaining functions (check_cookies, for_obj_iter) would not be
> implemented in the rte_mempool_aligned.h file.
>     >
>     >     Would the community be welcoming of a new rte_mempool_aligned
>     > struct? If you don't feel like this would be the way to go, are there
>     > other options in DPDK for creating a pool of pre-allocated aligned objects?
>     >
>     >     Thank you,
>     >
>     >     Seth Howell
>     >
>     >
>     >
> 
>

next prev parent reply	other threads:[~2019-03-27  2:33 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-25 21:06 Howell, Seth
2019-03-25 21:06 ` Howell, Seth
2019-03-25 21:13 ` Harris, James R
2019-03-25 21:13   ` Harris, James R
2019-03-26  2:52   ` Varghese, Vipin
2019-03-26  2:52     ` Varghese, Vipin
2019-03-26 18:34     ` Howell, Seth
2019-03-26 18:34       ` Howell, Seth
2019-03-26 18:59       ` Harris, James R
2019-03-26 18:59         ` Harris, James R
2019-03-27  2:33         ` Varghese, Vipin [this message]
2019-03-27  2:33           ` Varghese, Vipin
2019-03-27  8:28         ` Varghese, Vipin
2019-03-27  8:28           ` Varghese, Vipin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C9E0AB70F954A408CC4ADDBF0F8FA7D4D3265CB@BGSMSX101.gar.corp.intel.com \
    --to=vipin.varghese@intel.com \
    --cc=dev@dpdk.org \
    --cc=james.r.harris@intel.com \
    --cc=seth.howell@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).