Sharing the possible design for 4K aligned address for objects. > -----Original Message----- > From: Harris, James R > Sent: Wednesday, March 27, 2019 12:29 AM > To: Howell, Seth ; Varghese, Vipin > ; dev@dpdk.org > Subject: Re: Aligned rte_mempool for storage applications > > > > On 3/26/19, 11:34 AM, "Howell, Seth" wrote: > > Hi Vipin, > > Thanks for your quick reply. I will respond to your queries in order. > 1. Yes, in at least one case we have buffers of size 4096 bytes. Some of our > other buffers are much larger (>64KiB) > 2. These buffers are used in the I/O path, so performance is very important. > Allocating and freeing a buffer each time we use it could be pretty costly. > > I think Vipin may have been suggesting allocating one (or multiple) very large > buffers, and then splitting that buffer on 4KB boundaries in SPDK. If so, that > would still require SPDK to develop its own mempool-like feature to hold those > buffers. We'd really like to use the DPDK rte_mempool implementation rather > than inventing our own. > > 3. Could you describe the idea of an indirect buffer in more detail? I don't think > I quite understand that concept. I know we couldn't use mbufs because we often > have buffers that are larger than 64k. I think there are more reasons we don't use > the mbuf structure in our use case, but am not familiar with all of them. Maybe > Jim can explain those in more detail. > > SPDK doesn't use rte_mbufs (except when absolutely required for things like > DPDK cryptodev/compressdev). Most of that data structure is filled with network > packet related fields that would never be used for storage. We could create our > own very small data structure and do something similar to Vipin's indirect mbuf > suggestion. And I think this is what Vipin was starting to allude to in query #2. > > It would be less optimal than a native aligned mempool because we'd be adding > an extra pointer dereference on every get from the mempool - but probably only > slightly less optimal. Seth - let's sync up offline and see if we can quickly collect > some benchmarking data to measure the performance impact of this extra > dereference. > > Thanks Vipin - this definitely gives us an alternative direction to investigate that > we hadn't considered. > > -Jim > > > > Thanks, > > Seth > -----Original Message----- > From: Varghese, Vipin > Sent: Monday, March 25, 2019 7:53 PM > To: Harris, James R ; Howell, Seth > ; dev@dpdk.org > Subject: RE: Aligned rte_mempool for storage applications > > Hi Seth, > > If I may I would like to suggest and ask a query on the mempool alignment > details. Please find my suggestion and query inline to the email. > > Snipped > > > > In SPDK, we use the rte_mempool struct for many internal structure > > collections. The per-thread cache and ease of allocation of mempools > > are very useful features. > > Some of the collections we store in SPDK are pools of I/O buffers. > > Typically, these pools contain elements of at least 4096 bytes, and we > > would like them to be aligned to 4k for performance reasons. > Query-1> is the total memory required to be 4096 only (data portion)? > > > > > [Jim] Just to clarify Seth's point - the performance reasons are > > specifically to avoid wasteful memcopies. The vast majority of NVMe > > SSDs in the market today do not have full scatter/gather support - > > rather they only support something called PRP (Physical Region Pages) > > which require all scatter gather elements except the first to be 4KB > > aligned. There are other storage interfaces such as Linux AIO that also impose > alignment restrictions. > > > > -Jim > > > > > > Currently, the rte_mempool API doesn't support aligned mempool > > objects. This means that when we allocate a 4k buffer and want it > > aligned to 4k, we actually need to allocate an 8k buffer and calculate > > an offset into it each time we want to use it. > Query-2> why not create contiguous 4K aligned memory with rte_malloc? > > > We recently did a proof of concept using the rte_mempool_ops hook > > where we allocated a mempool and populated it with aligned entries. > > This allowed us to retrieve aligned addresses directly from > > rte_mempool_get(), but didn't help with the allocation size. > > Because the rte_mempool struct assumes that each element has a > > header attached to it, we still need to live up to that assumption for > > each object we create in a mempool. This means that the actual size of > > a buffer becomes 4k + 24 bytes. In order to get to our next aligned > > address, we need to add about 4k of padding to each element. > > Modifying the current rte_mempool struct to allow entries without > > headers seems impossible since it would break rte_mempool_for_obj_iter > > and rte_mempool_from_obj. However I still think there is a lot of > > benefit to be gained from a mempool structure that supports aligned objects > without headers. > > I am wondering if DPDK would be open to us introducing an > > rte_mempool_aligned structure. This structure would essentially be a > > wrapper around a regular mempool struct. However, it would not require > > headers or trailers for each object in the pool. > Query-3> using mempool with 0 size for data portion we can either create a > indirect buffer or use external mbuf to attach MBUF to 4K aligned rte_malloc > areas. > > Note: we did similar to the prototype for AF_XDP_ZC_PMD (presented in BLR > summit 2019). > > Advantage: no change in mempool library, mbuf library, or rte_malloc. > Application works with zero change. > > > > > This structure would only be applicable to a subset of mempools > > with the following characteristics: > > 1. mempools for which the following flags were set: > > MEMPOOL_F_NO_CACHE_ALIGNED, MEMPOOL_F_NO_IOVA_CONTIG , > > MEMPOOL_F_NO_SPREAD > > 2. mempools that do not require the use of the following > > functions rte_mempool_from_obj (requires a pointer to the mp in the > > header of each obj), rte_mempool_for_obj_iter. > > 3. Any attempt to create this object when > > RTE_LIBRTE_MEMPOOL_DEBUG was enabled would necessarily fail since we > > can't check the header cookies. > > > > My thought would be that we could implement this data structure in > > a header and it would look something like this: > > > > Struct rte_mempool_aligned { > > Struct rte_mempool mp; > > Size_t obj_alignment; > > }; > > > > The rest of the functions in the header would primarily be > > wrappers around the original functions. Most functions > > (rte_mempool_alloc, rte_mempool_free, rte_mempool_enqueue/dequeue, > > rte_mempool_get_count, etc.) could be implemented directly as > > wrappers, and others such as rte_mempool_create and the populate > > functions would have to be re-implemented to some degree in the new > > header. The remaining functions (check_cookies, for_obj_iter) would not be > implemented in the rte_mempool_aligned.h file. > > > > Would the community be welcoming of a new rte_mempool_aligned > > struct? If you don't feel like this would be the way to go, are there > > other options in DPDK for creating a pool of pre-allocated aligned objects? > > > > Thank you, > > > > Seth Howell > > > > > > > >